lex and yacc (symbol table generation) - yacc

I am new to lex and yacc and compiler design.
I would like to know at which phase(lexical, syntactical or any other phase) and how the symbol table is generated?
Can I have a brief description of y.output file which is generated by giving -v option to yacc.I tried to looking into it but didn't get much info.
Could I know the other applications where lex and yacc are used apart from compiler designs.

A symbol table is a global data structure that can be used in all stages/phases/passes of a compiler. This means that it can be used/accessed from both the lex and yacc generated components.
It is conventional to access the symbol table entry from the lexical analyser when it finds a token that would be stored in the table, such as an identifier, it can find the entry and update it with information only available to the lexer like line number and character position and it can also store the lexeme value if it is not already there. The symbol table pointer can now be returned in the lval of the token.
Some people prefer to return a pointer to the lexeme itself (as the lval) from the lexer to the parser and do the initial symbol table access there. This has an advantage that the symbol table does not have to be visible to the lexer, but has the disadvantage that lexer information as described above may no longer be available to store with the symbol. It often has the disadvantage of making the parser actions from yacc a little more "busy" as they then may be involved in managing the symbol table as well as the parse tree.
The symbol table entry will be further updated in later phases of the compiler, such as a semantic walk of the parse tree which can annotate the symbol entries with type information and flag undeclared objects and the like. The symbol table will be used again during target code generation when target specific information may be stored or needed, and again during optimisation when variables usage may be examined or even optimised away.
The symbol table is a data structure that you the compiler writer create for yourself. There is no feature of lex or yacc that does it for you. It is generated as and when any code you write creates it!
The y.output file has nothing to do with symbol tables. It is a record of how yacc converted the context free grammar into a parse table. It is useful when you have an ambiguous grammar and want to know what rules are causing the shift/reduce or reduce/reduce errors when debugging your grammar.
The last part of the question, what uses do these tool have? lex is a tool that generates code for a state machine that recognises the patterns you specified. It does not have to be used in writing compilers. One interesting use is in handling networking protocols that can be processed by a state machine, such as TCP/IP datagrams and so forth. Similarly, yacc is used in matching sequences that are described by context free grammars. These do not have to be programs, but could be other complex sequences of symbols, fields or data items. They are just normally pieces of text, and that is the orthodox use of the tool.
These parts of your question really sound like the kind of exam question that someone might write for students who have attended a course in compilers!

Related

Reorder token stream in XText lexer

I am trying to lex/parse an existing language by using XText. I have already decided I will use a custom ANTLRv3 lexer, as the language cannot be lexed in a context-free way. Fortunately, I do not need parser information; just the previously encountered tokens is enough to decide the lexing mode.
The target language has an InputSection that can be described as follows: InputSection: INPUT_SECTION A=ID B=ID;. However, it can be specified in two different ways.
; The canonical way
$InputSection Foo Bar
$SomeOtherSection Fonzie
; The haphazard way
$InputSection Foo
$SomeOtherSection Fonzie
$InputSection Bar
Could I use TokenStreamRewriter to reorder all tokens in the canonical way, before passing this on to the parser? Or will this generate issues in XText later?
After a lot of investigation, I have come to the conclusion that editor tools themselves are truly not fit for this type of problem.
If you would start typing on one rule, you would have to take into account the AST context from subsequent sections to know how to auto-complete. At the same time, this will be very confusing for the user.
In the end, I will therefore simply not support this obscure feature of the language. Instead, the AST will be constructed so that a section (reasonably) divided between two parts will still parse correctly.

Is there a way to generate unit test to test my grammar

I created my grammar using antlr4 but I want to test robustess
is there an automatic tool or a good way to do that fast
Thanks :)
As it's so hard to find real unit tests for ANTLR, I wrote 2 articles about it:
Unit test for Lexer
Unit test for Parser
A Lexer test checks whether a given text is read and converted to the expected Token sequences. It's useful for instance to avoid ambiguity errors.
A Parser test take a sequence of tokens (that is, it starts after the lesser part) and checks whether that token sequence traverses the expected rules ( java methods).
The only way I found to create unit tests for a grammar is to create a number of examples from a written spec of the given language. This is neither fast, nor complete, but I see no other way.
You could be tempted to create test cases directly from the grammar (writing a tool for that isn't that hard). But think a moment about this. What would you test then? Your unit tests would always succeed, unless you use generated test cases from an earlier version of the grammar.
A special case is when you write a grammar for a language that has already a grammar for another parser generation tool. In that case you could use the original grammar to generate test cases which you then can use to test your new grammar for conformity.
However, I don't know any tool that can generate the test cases for you.
Update
Meanwhile I got another idea that would allow for better testing: have a sentence generator that generates random sentences from your grammar (I'm currently working on one in my Visual Studio Code ANTLR4 extension). The produced sentences can then be examined using a heuristic approach, for their validity:
Confirm the base structure.
Check for mandatory keywords and their correct order.
Check that identifiers and strings are valid.
Watch out for unusual constructs that are not valid according to language.
...
This would already cover a good part of the language, but has limits. Matching code and generating it are not 1:1 operations. A grammar rule that matches certain (valid) input might generate much more than that (and can so produce invalid input).
In one chapter of his book 'Software Testing Techniques' Boris Beizer addresses the topic of 'syntax testing'. The basic idea is to (mentally or actually) take a grammar and represent it as a syntax diagram (aka railroad diagram). For systematic testing, this graph then would be covered: Good cases where the input matches the elements, but also bad cases for each node. Iterations and recursive calls would be handled like loops, that is, cases with zero, one, two, one less than max, max, once above max iterations (i.e. occurrences of the respective syntactic element).

ANTLR: Source to Target Language Conversion

I have fair understanding on ANTLR & grammar. Is it correct to say ANTLR can do source language to target language conversion like ASP to JSP or COBOL to JSP? if yes, could you help me to provide some information/tutorial/link to explorer the possibilities?
Idea is to pragmatically translating huge amounts of code from source to target using ANTLR.
Thanks
The basic steps to building a translator in Antlr4 is to:
generate a parse tree from an input text in the source language
repeatedly walk the parse tree to analyze the nodes of the parse tree, adding and evolving properties (decorator pattern) associated with individual parse tree nodes -- the properties will describe the change(s) required to represent the content of the node in the target language.
final walk of the parse tree to collect and output the target language text.
The form and content of the properties and the progression of creation and evolution will be entirely dependent on the nature of the source and target languages and the architect's conversion strategy.
Since Antlr parse-tree walks can be logically independent of one another, specific conversion aspects can be addressed in separate walks. For example, one walk can evaluate (possibly among other things) whether individual perform until statements will be converted to if or while statements. Another walk can be dedicated to analyzing variable names to ensure they are created/accessed in the correct scope and determining the naming and scope of any target language required temporary variables. Etc.
Given that the conversion is a one-time affair, there is no fundamental penalty to implementing 5, 10, or even more walks. Just the 'whatever makes sense in your case' practicality.
The (relevant) caveat addressed in the other QA is how to handle conversions where there is no simple or near identity between statements in the two languages. To convert a unique source language statement then requires a target language run-time package be created to implement the corresponding function.
GenPackage (I am the author) automates the generation of a basic conversion project. The generated project represents but one possible architectural approach and leaves substantial work to be done to tailor it to any particular end use.

Generating random but still valid expressions based on yacc/bison/ANTLR grammar

Is it possible? Any tool available for this?
You can do this with any system that gives you access to base grammar. ANTLR and YACC compile your grammar away so you don't have them anymore. In ANTLR's case, the grammar has been turned into code; you're not going to get it back. In YACC's case, you end up with parser tables, which contain the essence of the grammar; you could walk such parse tables if you understood them well enough to do what I describe below as.
It is easy enough to traverse a set of explicitly represented grammar rules and randomly choose expansions/derivations. By definition this will get you valid syntax.
What it won't do is get you valid code. The problem here is that most languages really have context sensitive syntax; most programs aren't valid unless the declared identifiers are used in a way consistent with their declaration and scoping rules. That latter requires a full semantic check.
Our DMS Software Reengineering Toolkit is used to parse code in arbitrary languages [using a grammar], build ASTs, lets you analyze and transform those trees, and finally prettyprint valid (syntactic) text. DMS provides direct access to the grammar rules, and tree building facilities, so it is pretty easy to generate random syntactic trees (and prettyprint). Making sure they are semantically valid is hard with DMS too; however, many of DMS's front ends can take a (random) tree and do semantic checking, so at least you'd know if the tree was semantically valid.
What you do if it says "no" is still an issue. Perhaps you can generate identifier names in way that guarantees at least not-inconsistent usage, but I suspect that would be langauge-dependent.
yacc and bison turn your grammar into a finite state machine. You should be able to traverse the state machine randomly to find valid inputs.
Basically, at each state you can either shift a new token on to the stack and move to a new state or reduce the top token in the stack based on a set of valid reductions. (See the Bison manual for details about how this works).
Your random generator will traverse the state machine making random but valid shifts or reductions at each state. Once you reach the terminal state you have a valid input.
For a human readable description of the states you can use the -v or --report=state option to bison.
I'm afraid I can't point you to any existing tools that can do this.

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?
Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.