How does yacc generate the syntactic parser from grammar rules? - yacc

I've understood how lexical analysis works,
but no idea how the syntactic analysis is done,
though in principle they two should similar(The only difference lies in the
type of their input symbols, characters or tokens.) ,
but the generated parser code is greatly different.
Especially the yy_action,yy_lookahead,there's no such thing in lexical analysis...

The grammars used to generate lexical analyzers generally are regular grammars, while the grammars used to generated syntatic analyzers generally are context-free grammars. Although they might look the same at the surface, they have very different characteristics and capabilities. Regular grammars can be recognized by deterministic finite automatons, which are relatively simple to construct and make fast. Context-free grammars are more challenging to build a recognizer for and usually a parser generator tool will construct a parser for only a subset of context-free grammars. For example, yacc constructs parsers for context-free grammars that are also LALR(1) grammars using push-down automata.
For more information on parsing, I would highly recommend Parsing Techniques, which walks through all the nuances of parsing in excruciating (but well described!) detail.

Related

Split Grammar and Grammar Attributes (Procedural Generation of Buildings)

I am trying to implement algorithm for procedural generation of buildings for my thesis. I've been reading a lot about shape/split grammars. Here's the link to the most popular paper covering that topic.
I managed to implement very basic grammar without attributes and guard conditions. It can generate primitives, and do geometric transformations. I'm using flex and bison to parse shape grammar files, and generate objects (Symbols, Rules etc.) that representat given grammar in object oriented manner that could later be called to generate geometry.
But now I'm stuck with the attribute part. For example:
fac(h) : h > 9; floor(h/3) floor(h/3) floor(h/3)
I am clueless how to represent grammar to contains informations about attributes, how to pass values to the symbols on the left and how to evaluate condition.
Can anyone help me with this, please? I'm using C++.
Note: I have some knowledge about grammars and parsers, and I know how to implement top-down parser with attributes using recursive descent but that approach is useless here. I can't generate source code for functions because interpretation of grammar files will be done in the same runtime as the application. Even if I could, this is not parser but generator of sentences and there are lot more problems like condition evaluation and production selection.

Difference between a regular language and a regular grammar

My book gives similar but slightly different explanations of regular grammar and regular language.
I doubt it's wrong, is a regular language the same thing of a regular grammar?
The definition of my book is:
A grammar is regular if all the productions are V-> aW or V->Wa with V,W non terminal or terminal symbols, "a" terminal symbol.W can also be empty or be the same of V.
Regular grammars and regular languages are two different terms:
A language is a (possibly infinite) set of valid sequences of terminal symbols.
A grammar defines which are the valid sequences.
The same language could be represented with different class of grammars (regular, context free, etc.). A language is said to be regular if it can be represented with a regular grammar. On the othet hand, a regular grammar always defines a regular language. What you have posted is the definition of the regular grammar.
See this Wikipedia post for further information.
A formal grammar is a set of rules, whereas a formal language is a set of strings.
A regular grammar is a formal grammar that describes a regular language.
According to Wikipedia:
[T]he left regular grammars generate exactly all regular languages. The right regular grammars describe the reverses of all such languages, that is, exactly the regular languages as well.
If mixing of left-regular and right-regular rules is allowed, we still have a linear grammar, but not necessarily a regular one.
In the above, left-regular rules are rules of the form V->Wa (right-regular, of the form V->aW).
I think if I explain the difference between a language and grammar, your queries will automatically get resolved.
A language is a set of strings over some set of alphabets satisfying certain rules encoded as grammars, while
Grammars are used to generate languages.
So basically grammars denote the syntactical rules of a string and the set of strings that can be generated with the start symbol of the grammar is called the Language of the grammar

chomsky hierarchy in plain english

I'm trying to find a plain (i.e. non-formal) explanation of the 4 levels of formal grammars (unrestricted, context-sensitive, context-free, regular) as set out by Chomsky.
It's been an age since I studied formal grammars, and the various definitions are now confusing for me to visualize. To be clear, I'm not looking for the formal definitions you'll find everywhere (e.g. here and here -- I can google as well as anyone else), or really even formal definitions of any sort. Instead, what I was hoping to find was clean and simple explanations that don't sacrifice clarity for the sake of completeness.
Maybe you get a better understanding if you remember the automata generating these languages.
Regular languages are generated by regular automata. They have only have a finit knowledge of the past (their compute memory has limits) so everytime you have a language with suffixes depending on prefixes (palindrome language) this can not be done with regular languages.
Context-free languages are generated by nondeterministic pushdown automata. They have a kind of knowledge of the past (the stack, which is not limited in contrast to regular automata) but a stack can only be viewed from top so you don't have complete knowledge of the past.
Context-sensitive languages are generated by linear-bound non-deterministic turing machines. They know the past and can deal with different contexts because they are non-deterministic and can access all the past at every time.
Unrestricted languages are generated by Turing machines. According to the Church-Turing-Thesis turing machines are able to calculate everything you can imagine (which means everything decidable).
As for regular languages, there are many equivalent characterizations. They give many different ways of looking at regular languages. It is hard to give a "plain English" definition, and if you find it hard to understand any of the characterizations of regular languages, it is unlikely that a "plain English" explanation will help. One thing to note from the definitions and various closure properties is that regular languages embody the notion of "finiteness" somehow. But this is again hard to appreciate without better familiarity with regular languages.
Do you find the notion of a finite automaton to be not simple and clean?
Let me mention some of the many equivalent characterizations (at least for other readers) :
Languages accepted by deterministic finite automata
Languages accepted by nondeterministic finite automata
Languages accepted by alternating finite automata
Languages accepted by two-way deterministic finite automata
Languages generated by left-linear grammars
Languages generated by right-linear grammars
Languages generated by regular expressions.
A union of some equivalence classes of a right-congruence of finite index.
A union of some equivalence classes of a congruence of finite index.
The inverse image under a monoid homomorphism of a subset of a finite monoid.
Languages expressible in monadic second order logic over words.
Regular: These languages answer yes/no with finite automata
Context free: These languages when given input word ( using state machiene and stack ) we can always answer yes/no if it is member of the language
Context sensitive: As long as production in grammar never shrinks ( α -> β ) we can answer yes/no (using state machiene and chunk of memory that is linear in size with input)
Recursively ennumerable: It can answer yes but in case of no it will go into infinite loop
see this video for full explanation.

Convert ANTLR grammar to Bison / EBNF

Is there a tool for converting an ANTLR grammar to a Bison grmmar?
I doubt it. Since ANTLR supports a broader class of grammars than Bison, it's only even possible for a subset of ANTLR grammars. At least from what I've seen, relatively few ANTLR grammars fit in the subset that could be directly converted to Bison.

Is it easier to write a recursive-descent parser using an EBNF or a BNF?

I've got a BNF and EBNF for a grammar. The BNF is obviously more verbose. I have a fairly good idea as far as using the BNF to build a recursive-descent parser; there are many resources for this. I am having trouble finding resources to convert an EBNF to a recursive-descent parser. Is this because it's more difficult? I recall from my CS theory classes that we went over EBNFs, but we didn't go over converting them into a recursive-descent parser. We did go over converting BNF's into a recursive-descent parser.
The reason I'm asking is because the EBNF is more compact.
From looking at the EBNF's in general, I notice that terms enclosed between { and } can be converted into a while loop. Are there any other guidelines or rules?
You should investigate so-called metacompilers, which essentially compile EBNF into recursive descent parsers. How they do it is exactly the answer your question.
(Its pretty straightfoward, but good to understand the details).
A really wonderful paper is the "MetaII" paper by Val Schorre. This is metacompiler technology from honest-to-God 1964. In 10 pages, he shows you how to build a metacompiler, and provides not just that, but another compiler too and the output of both!. There's an astonishing moment that you come too if you go build one of these, where you realized how the meta-compiler compiles itself using its own grammar. This moment got me
hooked on compiler back in about 1970 when I first tripped over this paper. This is one of those computer science papers that everybody in the software business should read.
James Neighbors (the inventor of the term "domain" in software engineering, and builder of the first program transformation system [based on these metacompilers] has a great online MetaII tutorial, for those of you that don't want the do-it-from-scratch experience. (I have nothing to do with this except that Neighbors and I were undergraduates together).
Both ways are a fine way to learn about metacompilers and generating parsers from EBNF.
The key ideas are that the left hand side of a rule creates a function that parses that nonterminal and returns true if match and advances the input stream; false if no match and the input stream doesn't advance.
The contents of the function is determined by the right hand side. Literal tokens are matched directly.
Nonterminals cause calls to other functions generated for the other rules.
Kleene* maps to while loops, alternations map to conditional branches. What EBNF doesn't address,
and the metacompilers do, is how does parsing do anyting other than saying "matched" or not?
The secret is weaving output operations into the EBNF. The MetaII paper makes all this crystal clear.
Neither is harder than the other. It is really the difference between implementing something iteratively and implementing something recursively. In BNF, everything is recursive. In EBNF, some of the recursion is expressed iteratively. There are different variations in EBNF syntax, so I'll just use the English... "zero or more" is a simple while loop as you have discovered. "One or more" is the same as one followed by "zero or more". "Zero or one times" is a simple if statement. That should cover most of the cases.
The early meta compilers META II and TREEMETA and their kin are not exactly recursive decent parser. They were were stated as using recursive functions. That just meant they could call them selves.
We do not call C a recursive language. A C or C++ function is recursive in the same way the early meta compilers are recursive.
Recursion can be used. They were programming languages. Recursion is generally used only when analyzing nexted language constructs. For example parenthesized expression and nexted blocks.
More of an LR recursive decent combination. CWIC the last documented one has extensive backtracking and look ahead features. The '-' not operator can match any language construct. And inverts it success or failure. -term fails if a term is matched for example. The input is never advanced. The '?' looks ahead and matches any language construct ?expr for example would try to parse an expr. The look ahead '?' matched construct is not kept or is the input advanced.