What are the steps I need to do to complete this programming assignment? - yacc

I'm having a hard time understanding what I'm supposed to do. The only thing I've figured out is I need to use yacc on the cminus.y file. I'm totally confused about everything after that. Can someone explain this to me differently so that I can understand what I need to do?
INTRODUCTION:
We will use lex/flex and yacc/Bison to generate an LALR parser. I will give you a file called cminus.y. This is a yacc format grammar file for a simple C-like language called C-minus, from the book Compiler Construction by Kenneth C. Louden. I think the grammar should be fairly obvious.
The Yahoo group has links to several descriptions of how to use yacc. Now that you know flex it should be fairly easy to learn yacc. The only base type is int. An int is 4 bytes. Booleans are handled as ints, as in C. (Actually the grammar allows you to declare a variable as a type void, but let's not do that.) You can have one-dimensional arrays.
There are no pointers, but references to array elements should be treated as pointers (as in C).
The language provides for assignment, IF-ELSE, WHILE, and function calls and returns.
We want our compiler to output MIPS assembly code, and then we will be able to run it on SPIM. For a simple compiler like this with no optimization, an IR should not be necessary. We can output assembly code directly in one pass. However, our first step is to generate a symbol table.
SYMBOL TABLE:
I like Dr. Barrett’s approach here, which uses a lot of pointers to handle objects of different types. In essence the elements of the symbol table are identifier, type and pointer to an attribute object. The structure of the attribute object will differ according to the type. We only have a small number of types to deal with. I suggest using a linear search to find symbols in the table, at least to start. You can change it to hashing later if you want better performance. (If you want to keep in C, you can do dynamic allocation of objects using malloc.)
First you need to make a list of all the different types of symbols that there are—there are not many—and what attributes would be necessary for each. Be sure to allow for new attributes to be added, because we
have not covered all the issues yet. Looking at the grammar, the question of parameter lists for functions is a place where some thought needs to be put into the design. I suggest more symbol table entries and pointers.
TESTING:
The grammar is correct, so taking the existing grammar as it is and generating a parser, the parser will accept a correct C-minus program but it won’t produce any output, because there are no code snippets associated with the rules.
We want to add code snippets to build the symbol table and print information as it does so.
When an identifier is declared, you should print the information being entered into the symbol table. If a previous declaration of the same symbol in the same scope is found, an error message should be printed.
When an identifier is referenced, you should look it up in the table to make sure it is there. An error message should be printed if it has not been declared in the current scope.
When closing a scope, warnings should be generated for unreferenced identifiers.
Your test input should be a correctly formed C-minus program, but at this point nothing much will happen on most of the production rules.
SCOPING:
The most basic approach has a global scope and a scope for each function declared.
The language allows declarations within any compound statement, i.e. scope nesting. Implementing this will require some kind of scope numbering or stacking scheme. (Stacking works best for a one-pass
compiler, which is what we are building.)

(disclaimer) I don't have much experience with compiler classes (as in school courses on compilers) but here's what I understand:
1) You need to use the tools mentioned to create a parser which, when given input will tell the user if the input is a correct program as to the grammar defined in cminus.y. I've never used yacc/bison so I don't know how it is done, but this is what seems to be done:
(input) file-of-some-sort which represents output to be parsed
(output) reply-of-some-sort which tells if the (input) is correct with respect to the provided grammar.
2) It also seems that the output needs to check for variable consistency (ie, you can't use a variable you haven't declared same as any programming language), which is done via a symbol table. In short, every time something is declared you add it to the symbol table. When you encounter an identifier, if it is not one of the language identifiers (like if or while or for), you'll look it up in the symbol table to determine if it has been declared. If it is there, go on. If it's not - print some-sort-of-error
Note: point(2) there is a simplified take on a symbol table; in reality there's more to them than I just wrote but that should get you started.
I'd start with yacc examples - see what yacc can do and how it does it. I guess there must be some big example-complete-with-symbol-table out there which you can read to understand further.
Example:
Let's take input A:
int main()
{
int a;
a = 5;
return 0;
}
And input B:
int main()
{
int a;
b = 5;
return 0;
}
and assume we're using C syntax for parsing. Your parser should deem Input A all right, but should yell "b is undeclared" for Input B.

Related

How does `Yacc` identifies function calls?

I am trying to figure out how yacc identifies function calls in a C code. For Example: if there is a function call like my_fun(a,b); then which rules does this statement reduces to.
I am using the cGrammar present in : C Grammar
Following the Grammar given over there manually; I figured out that we only have two choices in translation unit. Everything has to either be a function definition or a declaration. Now all declaration starts type_specifiers, storage_class_specifier etc but none of them starts with IDENTIFIER
Now in case of a function call the name would be IDENTIFIER. This leaves me unclear as to how it will be parsed and which rules will be used exactly?
According to the official yacc specification specified here yacc, everything is handled by user given routines. When you have a function call the name of course is IDENTIFIER.It is parsed using the user defined procedures.According to the specifications, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification.
Do have a look.By the way you are supposed to do a thorough research before putting questions here.

Ocaml naming convention

I am wondering if there exists already some naming conventions for Ocaml, especially for names of constructors, names of variables, names of functions, and names for labels of record.
For instance, if I want to define a type condition, do you suggest to annote its constructors explicitly (for example Condition_None) so as to know directly it is a constructor of condition?
Also how would you name a variable of this type? c or a_condition? I always hesitate to use a, an or the.
To declare a function, is it necessary to give it a name which allows to infer the types of arguments from its name, for example remove_condition_from_list: condition -> condition list -> condition list?
In addition, I use record a lot in my programs. How do you name a record so that it looks different from a normal variable?
There are really thousands of ways to name something, I would like to find a conventional one with a good taste, stick to it, so that I do not need to think before naming. This is an open discussion, any suggestion will be welcome. Thank you!
You may be interested in the Caml programming guidelines. They cover variable naming, but do not answer your precise questions.
Regarding constructor namespacing : in theory, you should be able to use modules as namespaces rather than adding prefixes to your constructor names. You could have, say, a Constructor module and use Constructor.None to avoid confusion with the standard None constructor of the option type. You could then use open or the local open syntax of ocaml 3.12, or use module aliasing module C = Constructor then C.None when useful, to avoid long names.
In practice, people still tend to use a short prefix, such as the first letter of the type name capitalized, CNone, to avoid any confusion when you manipulate two modules with the same constructor names; this often happen, for example, when you are writing a compiler and have several passes manipulating different AST types with similar types: after-parsing Let form, after-typing Let form, etc.
Regarding your second question, I would favor concision. Inference mean the type information can most of the time stay implicit, you don't need to enforce explicit annotation in your naming conventions. It will often be obvious from the context -- or unimportant -- what types are manipulated, eg. remove cond (l1 # l2). It's even less useful if your remove value is defined inside a Condition submodule.
Edit: record labels have the same scoping behavior than sum type constructors. If you have defined a {x: int; y : int} record in a Coord submodule, you access fields with foo.Coord.x outside the module, or with an alias foo.C.x, or Coord.(foo.x) using the "local open" feature of 3.12. That's basically the same thing as sum constructors.
Before 3.12, you had to write that module on each field of a record, eg. {Coord.x = 2; Coord.y = 3}. Since 3.12 you can just qualify the first field: {Coord.x = 2; y = 3}. This also works in pattern position.
If you want naming convention suggestions, look at the standard library. Beyond that you'll find many people with their own naming conventions, and it's up to you to decide who to trust (just be consistent, i.e. pick one, not many). The standard library is the only thing that's shared by all Ocaml programmers.
Often you would define a single type, or a single bunch of closely related types, in a module. So rather than having a type called condition, you'd have a module called Condition with a type t. (You should give your module some other name though, because there is already a module called Condition in the standard library!). A function to remove a condition from a list would be Condition.remove_from_list or ConditionList.remove. See for example the modules List, Array, Hashtbl,Map.Make`, etc. in the standard library.
For an example of a module that defines many types, look at Unix. This is a bit of a special case because the names are mostly taken from the preexisting C API. Many constructors have a short prefix, e.g. O_ for open_flag, SEEK_ for seek_command, etc.; this is a reasonable convention.
There's no reason to encode the type of a variable in its name. The compiler won't use the name to deduce the type. If the type of a variable isn't clear to a casual reader from the context, put a type annotation when you define it; that way the information provided to the reader is validated by the compiler.

Most appropriate data structure for dynamic languages field access

I'm implementing a dynamic language that will compile to C#, and it's implementing its own reflection API (.NET's is too slow, and the DLR is limited only to more recent and resourceful implementations).
For this, I've implemented a simple .GetField(string f) and .SetField(string f, object val) interface. Until recently, the implementation just switches over all possible field string values and makes the corresponding action.
Also, this dynamic language has the possibility to define anonymous objects. For those anonymous objects, at first, I had implemented a simple hash algorithm.
By now, I am looking for ways to optimize the dynamic parts of the language, and I have come across the fact that a hash algorithm for anonymous objects would be overkill. This is because the objects are usually small. I'd say the objects contain 2 or 3 fields, normally. Very rarely, they would contain more than 15 fields. It would take more time to actually hash the string and perform the lookup than if I would test for equality between them all. (This is not tested, just theoretical).
The first thing I did was to -- at compile-time -- create a red-black tree for each anonymous object declaration and have it laid onto an array so that the object can look for it in a very optimized way.
I am still divided, though, if that's the best way to do this. I could go for a perfect hashing function. Even more radically, I'm thinking about dropping the need for strings and actually work with a struct of 2 longs.
Those two longs will be encoded to support 10 chars (A-za-z0-9_) each, which is mostly a good prediction of the size of the fields. For fields larger than this, a special function (slower) receiving a string will also be provided.
The result will be that strings will be inlined (not references), and their comparisons will be as cheap as a long comparison.
Anyway, it's a little hard to find good information about this kind of optimization, since this is normally thought on a vm-level, not a static language compilation implementation.
Does anyone have any thoughts or tips about the best data structure to handle dynamic calls?
Edit:
For now, I'm really going with the string as long representation and a linear binary tree lookup.
I don't know if this is helpful, but I'll chuck it out in case;
If this is compiling to C#, do you know the complete list of fields at compile time? So as an idea, if your code reads
// dynamic
myObject.foo = "some value";
myObject.bar = 32;
then during the parse, your symbol table can build an int for each field name;
// parsing code
symbols[0] == "foo"
symbols[1] == "bar"
then generate code using arrays or lists;
// generated c#
runtimeObject[0] = "some value"; // assign myobject.foo
runtimeObject[1] = 32; // assign myobject.bar
and build up reflection as a separate array;
runtimeObject.FieldNames[0] == "foo"; // Dictionary<int, string>
runtimeObject.FieldIds["foo"] === 0; // Dictionary<string, int>
As I say, thrown out in the hope it'll be useful. No idea if it will!
Since you are likely to be using the same field and method names repeatedly, something like string interning would work well to quickly generate keys for your hash tables. It would also make string equality comparisons constant-time.
For such a small data set (expected upper bounds of 15) I think almost any hashing will be more expensive then a tree or even a list lookup, but that is really dependent on your hashing algorithm.
If you want to use a dictionary/hash then you'll need to make sure the objects you use for the key return a hash code quickly (perhaps a single constant hash code that's built once). If you can prevent collisions inside of an object (sounds pretty doable) then you'll gain the speed and scalability (well for any realistic object/class size) of a hash table.
Something that comes to mind is Ruby's symbols and message passing. I believe Ruby's symbols act as a constant to just a memory reference. So comparison is constant, they are very lite, and you can use symbols like variables (I'm a little hazy on this and don't have a Ruby interpreter on this machine). Ruby's method "calling" really turns into message passing. Something like: obj.func(arg) turns into obj.send(:func, arg) (":func" is the symbol). I would imagine that symbol makes looking up the message handler (as I'll call it) inside the object pretty efficient since it's hash code most likely doesn't need to be calculated like most objects.
Perhaps something similar could be done in .NET.

Write a compiler for a language that looks ahead and multiple files?

In my language I can use a class variable in my method when the definition appears below the method. It can also call methods below my method and etc. There are no 'headers'. Take this C# example.
class A
{
public void callMethods() { print(); B b; b.notYetSeen();
public void print() { Console.Write("v = {0}", v); }
int v=9;
}
class B
{
public void notYetSeen() { Console.Write("notYetSeen()\n"); }
}
How should I compile that? what i was thinking is:
pass1: convert everything to an AST
pass2: go through all classes and build a list of define classes/variable/etc
pass3: go through code and check if there's any errors such as undefined variable, wrong use etc and create my output
But it seems like for this to work I have to do pass 1 and 2 for ALL files before doing pass3. Also it feels like a lot of work to do until I find a syntax error (other than the obvious that can be done at parse time such as forgetting to close a brace or writing 0xLETTERS instead of a hex value). My gut says there is some other way.
Note: I am using bison/flex to generate my compiler.
My understanding of languages that handle forward references is that they typically just use the first pass to build a list of valid names. Something along the lines of just putting an entry in a table (without filling out the definition) so you have something to point to later when you do your real pass to generate the definitions.
If you try to actually build full definitions as you go, you would end up having to rescan repatedly, each time saving any references to undefined things until the next pass. Even that would fail if there are circular references.
I would go through on pass one and collect all of your class/method/field names and types, ignoring the method bodies. Then in pass two check the method bodies only.
I don't know that there can be any other way than traversing all the files in the source.
I think that you can get it down to two passes - on the first pass, build the AST and whenever you find a variable name, add it to a list that contains that blocks' symbols (it would probably be useful to add that list to the corresponding scope in the tree). Step two is to linearly traverse the tree and make sure that each symbol used references a symbol in that scope or a scope above it.
My description is oversimplified but the basic answer is -- lookahead requires at least two passes.
The usual approach is to save B as "unknown". It's probably some kind of type (because of the place where you encountered it). So you can just reserve the memory (a pointer) for it even though you have no idea what it really is.
For the method call, you can't do much. In a dynamic language, you'd just save the name of the method somewhere and check whether it exists at runtime. In a static language, you can save it in under "unknown methods" somewhere in your compiler along with the unknown type B. Since method calls eventually translate to a memory address, you can again reserve the memory.
Then, when you encounter B and the method, you can clear up your unknowns. Since you know a bit about them, you can say whether they behave like they should or if the first usage is now a syntax error.
So you don't have to read all files twice but it surely makes things more simple.
Alternatively, you can generate these header files as you encounter the sources and save them somewhere where you can find them again. This way, you can speed up the compilation (since you won't have to consider unchanged files in the next compilation run).
Lastly, if you write a new language, you shouldn't use bison and flex anymore. There are much better tools by now. ANTLR, for example, can produce a parser that can recover after an error, so you can still parse the whole file. Or check this Wikipedia article for more options.

Writing a TemplateLanguage/VewEngine

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.