Write a compiler for a language that looks ahead and multiple files? - optimization

In my language I can use a class variable in my method when the definition appears below the method. It can also call methods below my method and etc. There are no 'headers'. Take this C# example.
class A
{
public void callMethods() { print(); B b; b.notYetSeen();
public void print() { Console.Write("v = {0}", v); }
int v=9;
}
class B
{
public void notYetSeen() { Console.Write("notYetSeen()\n"); }
}
How should I compile that? what i was thinking is:
pass1: convert everything to an AST
pass2: go through all classes and build a list of define classes/variable/etc
pass3: go through code and check if there's any errors such as undefined variable, wrong use etc and create my output
But it seems like for this to work I have to do pass 1 and 2 for ALL files before doing pass3. Also it feels like a lot of work to do until I find a syntax error (other than the obvious that can be done at parse time such as forgetting to close a brace or writing 0xLETTERS instead of a hex value). My gut says there is some other way.
Note: I am using bison/flex to generate my compiler.

My understanding of languages that handle forward references is that they typically just use the first pass to build a list of valid names. Something along the lines of just putting an entry in a table (without filling out the definition) so you have something to point to later when you do your real pass to generate the definitions.
If you try to actually build full definitions as you go, you would end up having to rescan repatedly, each time saving any references to undefined things until the next pass. Even that would fail if there are circular references.

I would go through on pass one and collect all of your class/method/field names and types, ignoring the method bodies. Then in pass two check the method bodies only.

I don't know that there can be any other way than traversing all the files in the source.
I think that you can get it down to two passes - on the first pass, build the AST and whenever you find a variable name, add it to a list that contains that blocks' symbols (it would probably be useful to add that list to the corresponding scope in the tree). Step two is to linearly traverse the tree and make sure that each symbol used references a symbol in that scope or a scope above it.
My description is oversimplified but the basic answer is -- lookahead requires at least two passes.

The usual approach is to save B as "unknown". It's probably some kind of type (because of the place where you encountered it). So you can just reserve the memory (a pointer) for it even though you have no idea what it really is.
For the method call, you can't do much. In a dynamic language, you'd just save the name of the method somewhere and check whether it exists at runtime. In a static language, you can save it in under "unknown methods" somewhere in your compiler along with the unknown type B. Since method calls eventually translate to a memory address, you can again reserve the memory.
Then, when you encounter B and the method, you can clear up your unknowns. Since you know a bit about them, you can say whether they behave like they should or if the first usage is now a syntax error.
So you don't have to read all files twice but it surely makes things more simple.
Alternatively, you can generate these header files as you encounter the sources and save them somewhere where you can find them again. This way, you can speed up the compilation (since you won't have to consider unchanged files in the next compilation run).
Lastly, if you write a new language, you shouldn't use bison and flex anymore. There are much better tools by now. ANTLR, for example, can produce a parser that can recover after an error, so you can still parse the whole file. Or check this Wikipedia article for more options.

Related

How can I make the origin of aliases in Cypress more apparent from the it/spec files?

My team is using aliases to set some important variables which are used within the it('Test',) blocks.
For example, we may be running the following command in a before step:
cy.setupSomeDynamicData()
Then the setupSomeDynamicData() method exists in a another file (ex: commands.js) and the setupSomeDynamicData() method may setup a couple aliases:
cypress/support/commands.js
setupSomeDynamicData() {
cy.createDynamicString(first).as('String1')
cy.createDynamicString(second).as('String2')
cy.createDynamicString(third).as('String3')
}
Now we go back to our spec/test file, and start using these aliases:
cypress/e2e/smallTest.cy.js
it('A small example test', function () {
cy.visit(this.String1)
// do some stuff...
cy.get(this.String2)
// do some stuff...
cy.visit(this.String3)
// do some stuff...
})
The problem is that unless you're the person who wrote the code, it's not obvious where this.String1, this.String2, or this.String3 are coming from nor when they were initialized (from the perspective of smallTest.cy.js) since the code that initializes the aliases is being executed in another file.
In the example, it's quite easy to Ctrl+F the codebase and search for these aliases but you have to really start doing some reverse engineering once you have more complex use cases.
I guess this feels like some sort of readability/maintainability problem because once you setup enough of these and the example I provided starts to get more complex then finding out where these aliases are created can be inconvenient. The this.* syntax makes it feel like you'd find these aliases variables somewhere within the same file in which they're being used but when you don't see any sign of them then it becomes evident that they've just magically been initialized (somewhere/somehow) and then the hunt 🕵🏼‍♂️ begins.
Some solutions that come to mind (which may be bad ideas) are:
Create JS objects with getters/setters. This way, it'll be a bit easier to trace where the variable you're using was "set"
Not use aliases, and instead, use global variables that can be imported into the spec/test files so it's clear where they are coming from then run a before/after hook to clear these variables so that the reset-per-test functionality remains.
Name the variables in a way where it's obvious that they are aliased and then spread the word/document this method within my team so that anytime they see this.aliasedString2 then they know it's coming from some method that performs these alias assignments.
I'm sure there may be a better way to handle this so just thought I'd post this question.

How to make a class that inherits the same methods as IO::Path?

I want to build a class in Raku. Here's what I have so far:
unit class Vimwiki::File;
has Str:D $.path is required where *.IO.e;
method size {
return $.file.IO.s;
}
I'd like to get rid of the size method by simply making my class inherit the methods from IO::Path but I'm at a bit of a loss for how to accomplish this. Trying is IO::Path throws errors when I try to create a new object:
$vwf = Vimwiki::File.new(path => 't/test_file.md');
Must specify a non-empty string as a path
in block <unit> at t/01-basic.rakutest line 24
Must specify a non-empty string as a path
I always try a person's code when looking at someone's SO. Yours didn't work. (No declaration of $vwf.) That instantly alerts me that someone hasn't applied Minimal Reproducible Example principles.
So I did and less than 60 seconds later:
IO::Path.new
Yields the same error.
Why?
The doc for IO::Path.new shows its signature:
multi method new(Str:D $path, ...
So, IO::Path's new method expects a positional argument that's a Str. You (and my MRE) haven't passed a positional argument that's a Str. Thus the error message.
Of course, you've declared your own attribute $path, and have passed a named argument to set it, and that's unfortunately confused you because of the coincidence with the name path, but that's the fun of programming.
What next, take #1
Having a path attribute that duplicates IO::Path's strikes me as likely to lead to unnecessary complexity and/or bugs. So I think I'd nix that.
If all you're trying to do is wrap an additional check around the filename, then you could just write:
unit class Vimwiki::File is IO::Path;
method new ($path, |) { $path.IO.e ?? (callsame) !! die 'nope' }
callsame redispatches the ongoing routine call (the new method call), with the exact same arguments, to the next best fitting candidate(s) that would have been chosen if your new one containing the callsame hadn't been called. In this case, the next candidate(s) will be the existing new method(s) of IO::Path.
That seems fine to get started. Then you can add other attributes and methods as you see fit...
What next, take #2
...except for the IO::Path bug you filed, which means you can't initialize attributes in the normal way because IO::Path breaks the standard object construction protocol! :(
Liz shows one way to workaround this bug.
In an earlier version of this answer, I had not only showed but recommended another approach, namely delegation via handles instead of ordinary inheritance. I have since concluded that that was over-complicating things, and so removed it from this answer. And then I read your issue!
So I guess the delegation approach might still be appropriate as a workaround for a bug. So if later readers want to see it in action, follow #sdondley's link to their code. But I'm leaving it out of this (hopefully final! famous last words...) version of this answer in the hope that by the time you (later reader) read this, you just need to do something really simple like take #1.

Runtime method to get names of argument variables?

Inside an Objective-C method, it is possible to get the selector of the method with the keyword _cmd. Does such a thing exist for the names of arguments?
For example, if I have a method declared as such:
- (void)methodWithAnArgument:(id)foo {
...
}
Is there some sort of construct that would allow me to get access to some sort of string-like representation of the variable name? That is, not the value of foo, but something that actually reflects the variable name "foo" in a local variable inside the method.
This information doesn't appear to be stored in NSInvocation or any of its related classes (NSMethodSignature, etc), so I'm not optimistic this can be done using Apple's frameworks or the runtime. I suspect it might be possible with some sort of compile-time macro, but I'm unfamiliar with C macros so I wouldn't know where to begin.
Edit to contain more information about what I'm actually trying to do.
I'm building a tool to help make working with third-party URL schemes easier. There are two sides to how I want my API to look:
As a consumer of a URL scheme, I can call a method like [twitterHandler showUserWithScreenName:#"someTwitterHandle"];
As a creator of an app with a URL scheme, I can define my URLs in a plist dictionary, whose key-value pairs look something like #"showUserWithScreenName": #"twitter://user?screenName={screenName}".
What I'm working on now is finding the best way to glue these together. The current fully-functioning implementation of showUserWithScreenName: looks something like this:
- (void)showUserWithScreenName:(NSString *)screenName {
[self performCommand:NSStringFromSelector(_cmd) withArguments:#{#"screenName": screenName}];
}
Where performCommand:withArguments: is a method that (besides some other logic) looks up the command key in the plist (in this case "showUserWithScreenName:") and evaluates the value as a template using the passed dictionary as the values to bind.
The problem I'm trying to solve: there are dozens of methods like this that look exactly the same, but just swap out the dictionary definition to contain the correct template params. In every case, the desired dictionary key is the name of the parameter. I'm trying to find a way to minimize my boilerplate.
In practice, I assume I'm going to accept that there will be some boilerplate needed, but I can probably make it ever-so-slightly cleaner thanks to NSDictionaryOfVariableBindings (thanks #CodaFi — I wasn't familiar with that macro!). For the sake of argument, I'm curious if it would be possible to completely metaprogram this using something like forwardInvocation:, which as far as I can tell would require some way to access parameter names.
You can use componentsSeparatedByString: with a : after you get the string from NSStringFromSelector(_cmd) and use your #selector's argument names to put the arguments in the correct order.
You can also take a look at this post, which is describing the method naming conventions in Objective C

What are the steps I need to do to complete this programming assignment?

I'm having a hard time understanding what I'm supposed to do. The only thing I've figured out is I need to use yacc on the cminus.y file. I'm totally confused about everything after that. Can someone explain this to me differently so that I can understand what I need to do?
INTRODUCTION:
We will use lex/flex and yacc/Bison to generate an LALR parser. I will give you a file called cminus.y. This is a yacc format grammar file for a simple C-like language called C-minus, from the book Compiler Construction by Kenneth C. Louden. I think the grammar should be fairly obvious.
The Yahoo group has links to several descriptions of how to use yacc. Now that you know flex it should be fairly easy to learn yacc. The only base type is int. An int is 4 bytes. Booleans are handled as ints, as in C. (Actually the grammar allows you to declare a variable as a type void, but let's not do that.) You can have one-dimensional arrays.
There are no pointers, but references to array elements should be treated as pointers (as in C).
The language provides for assignment, IF-ELSE, WHILE, and function calls and returns.
We want our compiler to output MIPS assembly code, and then we will be able to run it on SPIM. For a simple compiler like this with no optimization, an IR should not be necessary. We can output assembly code directly in one pass. However, our first step is to generate a symbol table.
SYMBOL TABLE:
I like Dr. Barrett’s approach here, which uses a lot of pointers to handle objects of different types. In essence the elements of the symbol table are identifier, type and pointer to an attribute object. The structure of the attribute object will differ according to the type. We only have a small number of types to deal with. I suggest using a linear search to find symbols in the table, at least to start. You can change it to hashing later if you want better performance. (If you want to keep in C, you can do dynamic allocation of objects using malloc.)
First you need to make a list of all the different types of symbols that there are—there are not many—and what attributes would be necessary for each. Be sure to allow for new attributes to be added, because we
have not covered all the issues yet. Looking at the grammar, the question of parameter lists for functions is a place where some thought needs to be put into the design. I suggest more symbol table entries and pointers.
TESTING:
The grammar is correct, so taking the existing grammar as it is and generating a parser, the parser will accept a correct C-minus program but it won’t produce any output, because there are no code snippets associated with the rules.
We want to add code snippets to build the symbol table and print information as it does so.
When an identifier is declared, you should print the information being entered into the symbol table. If a previous declaration of the same symbol in the same scope is found, an error message should be printed.
When an identifier is referenced, you should look it up in the table to make sure it is there. An error message should be printed if it has not been declared in the current scope.
When closing a scope, warnings should be generated for unreferenced identifiers.
Your test input should be a correctly formed C-minus program, but at this point nothing much will happen on most of the production rules.
SCOPING:
The most basic approach has a global scope and a scope for each function declared.
The language allows declarations within any compound statement, i.e. scope nesting. Implementing this will require some kind of scope numbering or stacking scheme. (Stacking works best for a one-pass
compiler, which is what we are building.)
(disclaimer) I don't have much experience with compiler classes (as in school courses on compilers) but here's what I understand:
1) You need to use the tools mentioned to create a parser which, when given input will tell the user if the input is a correct program as to the grammar defined in cminus.y. I've never used yacc/bison so I don't know how it is done, but this is what seems to be done:
(input) file-of-some-sort which represents output to be parsed
(output) reply-of-some-sort which tells if the (input) is correct with respect to the provided grammar.
2) It also seems that the output needs to check for variable consistency (ie, you can't use a variable you haven't declared same as any programming language), which is done via a symbol table. In short, every time something is declared you add it to the symbol table. When you encounter an identifier, if it is not one of the language identifiers (like if or while or for), you'll look it up in the symbol table to determine if it has been declared. If it is there, go on. If it's not - print some-sort-of-error
Note: point(2) there is a simplified take on a symbol table; in reality there's more to them than I just wrote but that should get you started.
I'd start with yacc examples - see what yacc can do and how it does it. I guess there must be some big example-complete-with-symbol-table out there which you can read to understand further.
Example:
Let's take input A:
int main()
{
int a;
a = 5;
return 0;
}
And input B:
int main()
{
int a;
b = 5;
return 0;
}
and assume we're using C syntax for parsing. Your parser should deem Input A all right, but should yell "b is undeclared" for Input B.

Writing a TemplateLanguage/VewEngine

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.