I'm in an undergraduate class where we're studying formal grammars right now. I asked my teacher if there was any known set of rules for creating context free grammars that
1) Was guaranteed to produce an unambiguous grammar.
2) Allows for the creation of any possible unambiguous grammar.
I am well aware that determining whether a grammar is ambiguous is undecidable. I'm not sure if the above stated idea is reducible to that, but after fiddling around with it a bit, I couldn't think of a method for making such a reduction. That said I'm really no expert at reductions or grammars. I tried googling for a while, but I only found pages about undecidability. Does anyone know?
I think you know enough to work this out to a satisfactory conclusion on your own.
Consider a set of rules which I claim will satisfy your constraints: they will accept only unambiguous grammars, and they will accept all unambiguous grammars. That is, a grammar G is unambiguous if, and only if, it follows the rules.
Now, to be useful, my set of rules must be such that it's possible to tell by looking at a grammar whether it obeys the rules or not. That is, there must be an effective procedure for determining the truth, for any grammar G, of the proposition "G follows the rules."
You may be able now to see where this is headed. If you don't, give yourself a minute; it may come to you.
Assume I have such a set of rules. If my description is correct, we now have an effective procedure for determining whether a grammar follows the rules. And we know that iff the grammar follows the rules, then the grammar is unambiguous. My rules now sound a lot like a claim to have an effective procedure for determining whether a grammar is unambiguous. The decidability result you mention says that there is no such effective procedure.
So you can know, without examination, that my set of rules does not satisfy all three constraints. Either my rules do allow some ambiguous grammars (violating your first constraint), or my rules reject some unambiguous grammar (violating your second constraint), or there is no effective procedure for determining whether a grammar obeys my rules or not (violating our intuitive understanding of "a set of rules").
[Addendum]
If we relax the constraint on the set of rules to allow rules to count as useful even if they do not always terminate, then I think that the answer is probably 'yes', on the general principle that for any program one can formulate a theorem true if and only if the program terminates, and for any theorem, one can construct a program which will terminate iff the theorem is true. (This has a name, but I'm blanking on it at the moment.)
Goedel's Completeness Theorem established that there is an algorithm which will generate (given enough time) any theorem of a logical system; the hitch is that it's not guaranteed to terminate, so that after any given period of time we may be unable to tell whether we are dealing with a true statement which will be recognized eventually, or a false statement for which the algorithm may not terminate.
That said, I do not know that the rules in question would be anything other than a statement, in logical form, of the definition of ambiguity (or its absence) in a grammar. The simplest way to construct such a device would be to generate all possible grammars, in order of ascending length, and for each grammar undertake the proof, one step at a time, each grammar taking turns with the others. (If we work on one grammar until we have a result, we may never get to the next grammar, so we have to perform a sort of round-robin, much like the simple algorithm for generating all sentences from a grammar by generating all derivations.)
I hope this helps.
Related
I'm working on a streaming rules engine, and some of my customers have a few hundred rules they'd like to evaluate on every event that arrives at the system. The rules are pure (i.e. non-side-effecting) Boolean expressions, and they can be nested arbitrarily deeply.
Customers are creating, updating and deleting rules at runtime, and I need to detect and adapt to the population of rules dynamically. At the moment, the expression evaluation uses an interpreter over the internal AST, and I haven't started thinking about codegen yet.
As always, some of the predicates in the tree are MUCH cheaper to evaluate than others, and I've been looking for an algorithm or data structure that makes it easier to find the predicates that are cheap, and that are validly interpretable as controlling the entire expression. My mental headline for this pattern is "ANDs all the way to the root", i.e. any predicate for which all ancestors are ANDs can be interpreted as controlling.
Despite several days of literature search, reading about ROBDDs, CNF, DNF, etc., I haven't been able to close the loop from what might be common practice in the industry to my particular use case. One thing I've found that seems related is Analysis and optimization for boolean expression indexing
but it's not clear how I could apply it without implementing the BE-Tree data structure myself, as there doesn't seem to be an open source implementation.
I keep half-jokingly mentioning to my team that we're going to need a SAT solver one of these days. š
I guess it would probably suffice to write a recursive algorithm that traverses the tree and keeps track of whether every ancestor is an AND or an OR, but I keep getting the "surely this is a solved problem" feeling. :)
Edit: After talking to a couple of friends, I think I may have a sketch of a solution!
Transform the expressions into Conjunctive Normal Form, in which, by definition, every node is in a valid short-circuit position.
Use the Tseitin algorithm to try to avoid exponential blowups in expression size as a result of the CNF transform
For each AND in the tree, sort it in ascending order of cost (i.e. cheapest to the left)
???
Profit!^Weval as usual :)
You should seriously consider compiling the rules (and the predicates). An interpreter is 10-50x slower than machine code for the same thing. This is a good idea if the rule set doesn't change very often. Its even a good idea if the rules can change dynamically because in practice they still don't change very fast, although now your rule compiler has be online. Eh, just makes for a bigger application program and memory isn't much of an issue anymore.
A Boolean expression evaluation using individual machine instructions is even better. Any complex boolean equation can be compiled in branchless sequences of individual machine instructions over the leaf values. No branches, no cache misses; stuff runs pretty damn fast. Now, if you have expensive predicates, you probably want to compile code with branches to skip subtrees that don't affect the result of the expression, if they contain expensive predicates.
Within reason, you can generate any equivalent form (I'd run screaming into the night over the idea of using CNF because it always blows up on you). What you really want is the shortest boolean equation (deepest expression tree) equivalent to what the clients provided because that will take the fewest machine instructions to execute. This may sound crazy, but you might consider exhaustive search code generation, e.g., literally try every combination that has a chance of working, especially if the number of operators in the equation is relatively small. The VLSI world has been working hard on doing various optimizations when synthesizing boolean equations into gates. You should look into the the Espresso hueristic boolean logic optimizer (https://en.wikipedia.org/wiki/Espresso_heuristic_logic_minimizer)
One thing that might drive you expression evaluation is literally the cost of the predicates. if I have formula A and B, and I know that A is expensive to evaluate and usually returns true, then clearly I want to evaluate B and A instead.
You should consider common sub expression evaluation, so that any common subterm is only computed once. This is especially important when one has expensive predicates; you never want to evaluate the same expensive predicate twice.
I implemented these tricks in a PLC emulator (these are basically machines that evaluate buckets [like hundreds of thousands] of boolean equations telling factory actuators when to move) using x86 machine instructions for AND/OR/NOT for Rockwell Automation some 20 years ago. It outran Rockwell's "premier" PLC which had custom hardware but was essentially an interpreter.
You might also consider incremental evaluation of the equations. The basic idea is not to re-evaluate all the equations over and over, but rather to re-evaluate only those equations whose input changed. Details are too long to include here, but a patent I did back then explains how to do it. See https://patents.google.com/patent/US5623401A/en?inventor=Ira+D+Baxter&oq=Ira+D+Baxter
I have a programming language that has many constructs in it however I am only interested in extracting expressions from the language.
Is that possible to do without having to write the entire grammar?
Yes, its possible. You want what is called an "island parser". https://en.wikipedia.org/wiki/Island_grammar. You might not actually
decide to do this, more below.
The essential idea is to provide detailed grammar rules for the part of the language ("islands") you care about, and sloppy rules for the rest ("water").
The detailed grammar rules... you write as would normally write them. This includes building a lexer and parser to parse the part you want.
The "water" part is implemented as much as you can by defining sloppy lexemes. You may need more than one, and you will likely have to handle nested structures e.g., things involving "("...")" "["..."] and "{" ... "}" and you will end up doing with explicit tokens for the boundaries of these structures, and recursive grammar rules that keep track of the nesting (because lexers being FSAs typically can't track this).
Not obvious when you start, but painfully obvious after you are deep into this mess is skipping over long comment bodies, and especially string literals with the various quotes allowed by the language (consider Python for an over the top set) and the escaped sequences inside. You'll get burned by languages that allow interpolated strings when you figure out that you have the lex the raw string content separately from the interpolated expressions, because these are also typically nested structures. PHP and C# allow arbitrary expressions in their interpolated strings.... including expressions which themselves can contain... more interpolated strings!
The upside is all of this isn't really hard technically, if you ignore the sweat labor to dream up and handle all the funny cases.
... but ... considering typical parsing goals, island grammars tend to fall apart when used for this purpose.
To process expressions, you usually need the language declarations that provide types for the identifiers. If you leave them in the "ocean" part... you don't get type declarations and now it is hard to reason about your expressions. If you are processing java, and you encounter (a+b), is that addition or string concatenation? Without type information you just don't know.
If you decide you need the type information, now you need the detailed grammar for the variable and type declarations. And suddenly you're a lot closer to a full parser. At some point, you bail and just build a full parser; then you don't have think about whether you've cheated properly.
You donāt mention your language, but thereās a good chance that thereās an ANTLR grammar for it here ANTLR Grammars
These grammars will parse the entire contents of the source (by doing this, you can avoid some āmessinessā that can come with trying decide when to pop into, and out of, island grammars, which could be particularly messy for expressions since they can occur in so many places within a typical source file.)
Once you have the resulting ParseTree, ANTLR provides a Listener capability that allows you to call a method to walk the tree and call you back for only those parts you are interested in. In your case that would be expressions.
A quick search on ANTLR Listeners should turn up several resources on how to write a Listener for your needs. (This is a pretty short article that covers the basics (in this case, for when youāre only interested in methods, but expressions would be similar. There are certainly others).
I created my grammar using antlr4 but I want to test robustess
is there an automatic tool or a good way to do that fast
Thanks :)
As it's so hard to find real unit tests for ANTLR, I wrote 2 articles about it:
Unit test for Lexer
Unit test for Parser
A Lexer test checks whether a given text is read and converted to the expected Token sequences. It's useful for instance to avoid ambiguity errors.
A Parser test take a sequence of tokens (that is, it starts after the lesser part) and checks whether that token sequence traverses the expected rules ( java methods).
The only way I found to create unit tests for a grammar is to create a number of examples from a written spec of the given language. This is neither fast, nor complete, but I see no other way.
You could be tempted to create test cases directly from the grammar (writing a tool for that isn't that hard). But think a moment about this. What would you test then? Your unit tests would always succeed, unless you use generated test cases from an earlier version of the grammar.
A special case is when you write a grammar for a language that has already a grammar for another parser generation tool. In that case you could use the original grammar to generate test cases which you then can use to test your new grammar for conformity.
However, I don't know any tool that can generate the test cases for you.
Update
Meanwhile I got another idea that would allow for better testing: have a sentence generator that generates random sentences from your grammar (I'm currently working on one in my Visual Studio Code ANTLR4 extension). The produced sentences can then be examined using a heuristic approach, for their validity:
Confirm the base structure.
Check for mandatory keywords and their correct order.
Check that identifiers and strings are valid.
Watch out for unusual constructs that are not valid according to language.
...
This would already cover a good part of the language, but has limits. Matching code and generating it are not 1:1 operations. A grammar rule that matches certain (valid) input might generate much more than that (and can so produce invalid input).
In one chapter of his book 'Software Testing Techniques' Boris Beizer addresses the topic of 'syntax testing'. The basic idea is to (mentally or actually) take a grammar and represent it as a syntax diagram (aka railroad diagram). For systematic testing, this graph then would be covered: Good cases where the input matches the elements, but also bad cases for each node. Iterations and recursive calls would be handled like loops, that is, cases with zero, one, two, one less than max, max, once above max iterations (i.e. occurrences of the respective syntactic element).
In a recent question (How to define (and name) the corresponding safe term comparison predicates in ISO Prolog?) #false asked for an implementation of the term ordering predicate lt/2, a variant of the ISO builtin (#<)/2.
The truth value of lt(T1,T2) is to be stable w.r.t. arbitrary variable bindings in T1 and T2.
In various answers, different implementations (based on implicit / explicit term traversal) were proposed. Some caveats and hints were raised in comments, so were counterexamples.
So my question: how can candidate implementations be tested? Some brute-force approach? Or something smarter?
In any case, please share your automatic testing machinery for lt/2! It's for the greater good!
There are two testing strategies: validation and verification.
Validation: Testing is always the same. First you need a specification of what you want to test. Second you need an implementation of what you want to test.
Then from the implementation you extract the code execution paths. And for each code execution path from the specification, you derive the desired outcome.
And then you write test cases combining each execution paths and the desired outcomes. Do not only test positive paths, but also negative paths.
If your code is recursive, you have theoretically infinitely many execution paths.
But you might find that a sub recursion more or less asks the same than what another test case already asked. So you can also test with a finite set in many cases.
Validation gives you confidence.
Verification: You would use some formal methods to derive the correctness of your implementation from the specification.
Verification gives you 100% assurance.
Is it possible? Any tool available for this?
You can do this with any system that gives you access to base grammar. ANTLR and YACC compile your grammar away so you don't have them anymore. In ANTLR's case, the grammar has been turned into code; you're not going to get it back. In YACC's case, you end up with parser tables, which contain the essence of the grammar; you could walk such parse tables if you understood them well enough to do what I describe below as.
It is easy enough to traverse a set of explicitly represented grammar rules and randomly choose expansions/derivations. By definition this will get you valid syntax.
What it won't do is get you valid code. The problem here is that most languages really have context sensitive syntax; most programs aren't valid unless the declared identifiers are used in a way consistent with their declaration and scoping rules. That latter requires a full semantic check.
Our DMS Software Reengineering Toolkit is used to parse code in arbitrary languages [using a grammar], build ASTs, lets you analyze and transform those trees, and finally prettyprint valid (syntactic) text. DMS provides direct access to the grammar rules, and tree building facilities, so it is pretty easy to generate random syntactic trees (and prettyprint). Making sure they are semantically valid is hard with DMS too; however, many of DMS's front ends can take a (random) tree and do semantic checking, so at least you'd know if the tree was semantically valid.
What you do if it says "no" is still an issue. Perhaps you can generate identifier names in way that guarantees at least not-inconsistent usage, but I suspect that would be langauge-dependent.
yacc and bison turn your grammar into a finite state machine. You should be able to traverse the state machine randomly to find valid inputs.
Basically, at each state you can either shift a new token on to the stack and move to a new state or reduce the top token in the stack based on a set of valid reductions. (See the Bison manual for details about how this works).
Your random generator will traverse the state machine making random but valid shifts or reductions at each state. Once you reach the terminal state you have a valid input.
For a human readable description of the states you can use the -v or --report=state option to bison.
I'm afraid I can't point you to any existing tools that can do this.