Pimp my VM (about performance and jitting) - interpreter

for one of my programs I needed a scripting language to dynamically change the world (unit ai, world generation etc), So I wrote a Compiler for a rather basic language (simple objects without inheritance, 1d arrays, 32 bit ints/floats, strings) which also uses reference counting for garbage collection. The Compiler outputs stack based bytecode.
My problem now is that my VM isnt efficient enough (it actually runs 15-30 times slower than unoptimised C). Its a really simple VM which implements decoding with a giant SWITCH-CASE block.
the vm code looks like this:
switch(*ip++)
case ADD:
...
break;
case SUB:
...
break;
So my question is if it is possible to recompile my scripts to x86 assembler and execute them them at runtime. (I think thats what JIT compilers do). I googled a lot but I didn´t found any code samples for example to send x86 code to the processor. If anyone has links to tutorials that explain how to build better VM´s I would be very happy.

Related

Ways to make a D program faster

I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).
Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

Don't Both JIT and non-JIT enabled Interpreters Ultimately Produce Machine Code

Ok, I have read several discussions regarding the differences between JIT and non-JIT enabled interpreters, and why JIT usually boosts performance.
However, my question is:
Ultimately, doesn't a non-JIT enabled interpreter have to turn bytecode (line by line) into machine/native code to be executed, just like a JIT compiler will do? I've seen posts and textbooks that say it does, and posts that say it does not. The latter argument is that the interpreter/JVM executes this bytecode directly with no interaction with machine/native code.
If non-JIT interpreters do turn each line into machine code, it seems that the primary benefits of JIT are...
The intelligence of caching either all (normal JIT) or frequently encountered (hotspot/adaptive optimization) parts of the bytecode so that the machine code compilation step is not needed every time.
Any optimization JIT compilers can perform in translating bytecode into machine code.
Is that accurate? There seems to be little difference (other than possible optimization, or JITting blocks vs line by line maybe) between the translation of bytecode to machine code via non-JIT and JIT enabled interpreters.
Thanks in advance.
A non-JIT interpreter doesn't convert bytecode to machine code. You can imagine the workings of a non-JIT bytecode interpreter something like this (I'll use a Java-like pseudocode):
int[] bytecodes = { ... };
int ip = 0; // instruction pointer
while(true) {
int code = bytecodes[ip];
switch(code) {
case 0;
// do something
ip += 1; break;
case 1:
// do something else
ip += 1; break;
// and so on...
}
}
So for every bytecode executed, the interpreter has to retrieve the code, switch on its value to decide what to do, and increment its "instruction pointer" before going to the next iteration.
With a JIT, all that overhead would be reduced to nothing. It would just take the contents of the appropriate switch branches (the parts that say "// do something"), string them together in memory, and execute a jump to the beginning of the first one. No software "instruction pointer" would be required -- only the CPU's hardware instruction pointer. No retrieving of bytecodes from memory and switching on their values either.
Writing a virtual machine is not difficult (if it doesn't have to be extremely high performance), and can be an interesting exercise. I did one once for an embedded project where the program code had to be very compact.
Decades ago, there seemed to be a widespread belief that compilers would turn an entire program into machine code, while interpreters would translate a statement into machine code, execute it, discard it, translate the next one, etc. That notion was 99% incorrect, but there were two a tiny kernels of truth to it. On some microprocessors, some instructions required the use of addresses that were specified in code. For example, on the 8080, there was an instruction to read or write a specified I/O address 0x00-0xFF, but there was no instruction to read-or write an I/O address specified in a register. It was common for language interpreters, if user code did something like "out 123,45", to store into three bytes of memory the instructions "out 7Bh/ret", load the accumulator with 2Dh, and make a call to the first of those instructions. In that situation, the interpreter would indeed be producing a machine code instruction to execute the interpreted instruction. Such code generation, however, was mostly limited to things like IN and OUT instructions.
Many common Microsoft BASIC interpreters for the 6502 (and perhaps the 8080 as well) made somewhat more extensive use of code stored in RAM, but the code that was stored in RAM not not significantly depend upon the program that was executing; the majority of the RAM routine would not change during program execution, but the address of the next instruction was kept in-line as part of the routine allowing the use of an absolute-mode "LDA" instruction, which saved at least one cycle off every byte fetch.

Simulating multiple instances of an embedded processor

I'm working on a project which will entail multiple devices, each with an embedded (ARM) processor, communicating. One development approach which I have found useful in the past with projects that only entailed a single embedded processor was develop the code using Visual Studio, divided into three portions:
Main application code (in unmanaged C/C++ [see note])
I/O-simulating code (C/C++) that runs under Visual Studio
Embedded I/O code (C), which Visual Studio is instructed not to build, runs on the target system. Previously this code was for the PIC; for most future projects I'm migrating to the ARM.
Feeding the embedded compiler/linker the code from parts 1 and 3 yields a hex file that can run on the target system. Running parts 1 and 2 together yields code which can run on the PC, with the benefit of better debugging tools and more precise control over I/O behavior (e.g. I can make the simulation code introduce certain types of random hiccups more easily than I can induce controlled hiccups on real hardware).
Target code is written in C, but the simulation environment uses C++ so as to simulate I/O registers. For example, I have a PortArray data structure; the header file for the embedded compiler includes a line like unsigned char LATA # 0xF89; and my header file for simulation includes #define LATA _IOBIT(f89,1) which in turn invokes a macro that accesses a suitable property of an I/O object, so a statement like LATA |= 4; will read the simulated latch, "or" the read value with 4, and write the new value. To make this work, the target code has to compile under C++ as well as under C, but this mostly isn't a problem. The biggest annoyance is probably with enum types (which behave as integers in C, but have to be coaxed to do so in C++).
Previously, I've used two approaches to making the simulation interactive:
Compile and link a DLL with target-application and simulation code, and have VB code in the same project which interacts with it.
Compile the target-application code and some simulation code to an EXE with instance of Visual Studio, and use a second instance of Visual Studio for the simulation-UI. Have the two programs communicate via TCP, so nearly all "real" I/O logic is in the simulation program. For example, the aforementioned `LATA |= 4;` would send a "read port 0xF89" command to the TCP port, get the response, process the received value, and send a "write port 0xF89" command with the result.
I've found the latter approach to run a tiny bit slower than the former in some cases, but it seems much more convenient for debugging, since I can suspend execution of the unmanaged simulation code while the simulation UI remains responsive. Indeed, for simulating a single target device at a time, I think the latter approach works extremely well. My question is how I should best go about simulating a plurality of target devices (e.g. 16 of them).
The difficulty I have is figuring out how to make each simulated instance get its own set of global variables. If I were to compile to an EXE and run one instance of the EXE for each simulated target device, that would work, but I don't know any practical way to maintain debugger support while doing that. Another approach would be to arrange the target code so that everything would compile as one module joined together via #include. For simulation purposes, everything could then be wrapped into a single C++ class, with global variables turning into class-instance variables. That would be a bit more object-oriented, but I really don't like the idea of forcing all the application code to live in one compiled and linked module.
What would perhaps be ideal would be if the code could load multiple instances of the DLL, each with its own set of global variables. I have no idea how to do that, however, nor do I know how to make things interact with the debugger. I don't think it's really necessary that all simulated target devices actually execute code simultaneously; it would be perfectly acceptable for simulation instances to use cooperative multitasking. If there were some way of finding out what range of memory holds the global variables, it might be possible to have the 'task-switch' method swap out all of the global variables used by the previously-running instance and swap in the contents applicable to the instance being switched in. Although I'd know how to do that in an embedded context, though, I'd have no idea how to do that on the PC.
Edit
My questions would be:
Is there any nicer way to allow simulation logic to be paused and examined in VS2010 debugger, while keeping a responsive UI for the simulator front-end, than running the simulator front end and the simulator logic in separate instances of VS2010, if the simulation logic must be written in C and the simulation front end in managed code? For example, is there a way to tell the debugger that when a breakpoint is hit, some or all other threads should be allowed to keep running while the thread that had hit the breakpoint sits paused?
If the bulk of the simulation logic must be source-code compatible with an embedded system written in C (so that the same source files can be compiled and run for simulation purposes under VS2010, and then compiled by the embedded-systems compiler for use in real hardware), is there any way to have the VS2010 debugger interact with multiple simulated instances of the embedded device? Assume performance is not likely to be an issue, but the number of instances will be large enough that creating a separate project for each instance would be likely be annoying in the absence of any way to automate the process. I can think of three somewhat-workable approaches, but don't know how to make any of them work really nicely. There's also an approach which would be better if it's possible, but I don't know how to make it work.
Wrap all the simulation code within a single C++ class, such that what would be global variables in the target system become class members. I'm leaning toward this approach, but it would seem to require everything to be compiled as a single module, which would annoyingly affect the design of the target system code. Is there any nice way to have code access class instance members as though they were globals, without requiring all functions using such instances to be members of the same module?
Compile a separate DLL for each simulated instance (so that e.g. if I want to run up to 16 instances, I would include 16 DLL's in the project, all sharing the same source files). This could work, but every change to the project configuration would have to be repeated 16 times. Really ugly.
Compile the simulation logic to an EXE, and run an appropriate number of instances of that EXE. This could work, but I don't know of any convenient way to do things like set a breakpoint common to all instances. Is it possible to have multiple running instances of an EXE attached to a single debugger instance?
Load multiple instances of a DLL in such a way that each instance gets its own global variables, while still being accessible in the debugger. This would be nicest if it were possible, but I don't know any way to do so. Is it possible? How? I've never used AppDomains, but my intuition would suggest that might be useful here.
If I use one VS2010 instance for the front-end, and another for the simulation logic, is there any way to arrange things so that starting code in one will automatically launch the code in the other?
I'm not particularly committed to any single simulation approach; while it might be nice to know if there's some way of slightly improving the above, I'd also like to know of any other alternative approaches that could work even better.
I would think that you'd still have to run 16 copies of your main application code, but that your TCP-based I/O simulator could keep a different set of registers/state for each TCP connection that comes in.
Instead of a bunch of global variables, put them into a single structure that encompasses the I/O state of a single device. Either spawn off a new thread for each socket, or just keep a list of active sockets and dedicate a single instance of the state structure for each socket.
the simulators I have seen that handle multiple instances of the instruction set/processor are designed that way. There is a structure usually that contains a complete set of registers, and a new pointer or an array of these structures are used to multiply them into multiple instances of the processor.

When someone writes a new programming language, what do they write it IN?

I am dabbling in PHP and getting my feet wet browsing SO, and feel compelled to ask a question that I've been wondering about for years:
When you write an entirely new programming language, what do you write it in?
It's to me a perplexing chicken & egg thing to me. What do you do? Say to yourself Today I'm going to invent a new language! and then fire up. Notepad? Are all compilers built on previously existing languages, such that were one to bother one could chart all programming languages ever devised onto one monstrous branching tree that eventually grounded out at... I don't know, something old?
It's not a stupid question. It's an excellent question.
As already answered the short answer is, "Another language."
Well that leads to some interesting questions? What if its the very first language written for
your particular piece of hardware? A very real problem for people who work on embedded devices. As already answered "a language on another computer". In fact some embedded devices will never get a compiler, their programs will always be compiled on a different computer.
But you can push it back even further. What about the first programs ever written?
Well the first compilers for "high level languages" would have been written in whats called "assembly language". Assembly language is a language where each instruction in the language corresponds to a single instruction to the CPU. Its very low level language and extremely verbose and very labor intensive to write in.
But even writing assembly language requires a program called an assembler to convert the assembly language into "machine language". We go back further. The very first assemblers were written in "machine code". A program consisting entirely of binary numbers that are a direct one-to-one correspondence with the raw language of the computer itself.
But it still doesn't end. Even a file with just raw numbers in it still needs translation. You still need to get those raw numbers in a file into the computer.
Well believe it or not the early computers had a row of switches on the front of them. You flipped the switches till they represented a binary number, then you flicked another switch and that loaded that single number into the computers memory. Then you kept going flicking switched until you had loaded a minimal computer program that could read programs from disk files or punch cards. You flicked another switch and it started the program running. When I went to university in the 80's I saw computers that had that capacity but never was given the job of loading in a program with the switches.
And even earlier than that computer programs had to be hard wired with plug boards!
The most common answer is C. Most languages are implemented in C or in a hybrid of C with callbacks and a "lexer" like Flex and parser generator like YACC. These are languages which are used for one purpose - to describe syntax of another language. Sometimes, when it comes to compiled languages, they are first implemented in C. Then the first version of the language is used to create a new version, and so on. (Like Haskell.)
A lot of languages are bootstrapped- that is written in themselves. As to why you would want to do this, it is often a good idea to eat your own dogfood.
The wikipedia article I refer to discusses the chicken and egg issue. I think you will find it quite interesting.
Pretty much any language, though using one suited to working with graphs and other complex data structures will make many things easier. Production compilers are often written in C or C++ for performance reasons, but languages such as OCaml, SML, Prolog, and Lisp are arguably better for prototyping the language.
There are also several "little languages" used in language design. Lex and yacc are used for specifying syntax and grammars, for example, and they compile to C. (There are ports for other languages, such as ocamllex / ocamlyacc, and many other similar tools.)
As a special case, new Lisp dialects are often built on existing Lisp implementations, since they can piggyback on most of the same infrastructure. Writing a Scheme interpreter can be done in Scheme in under a page of code, at which point one can easily add new features.
Fundamentally, compilers are just programs that read in something and translate it to something else - converting LaTeX source to DVI, converting C code to assembly and then to machine language, converting a grammar specification to C code for a parser, etc. Its designer specifies the structure of the source format (parsing), what those structures mean, how to simplify the data (optimizing), and the kind of output to generate. Interpreters read the source and execute it directly. (Interpreters are typically simpler to write, but much slower.)
"Writing a new programming language" technically doesn't involve any code. It's just coming up with a specification for what your language looks like and how it works. Once you have an idea of what your language is like, you can write translators and interpreters to actually make your language "work".
A translator inputs a program in one language and outputs an equivalent program in another language. An interpreter inputs a program in some language and runs it.
For example, a C compiler typically translates C source code (the input language) to an assembly language program (the output language). The assembler then takes the assembly language program and produces machine language. Once you have your output, you don't need the translators to run your program. Since you now have a machine language program, the CPU acts as the interpreter.
Many languages are implemented differently. For example, javac is a translator that converts Java source code to JVM bytecode. The JVM is an interpreter [1] that runs Java bytecode. After you run javac and get bytecode, you don't need javac anymore. However, whenever you want to run your program, you'll need the JVM.
The fact that translators don't need to be kept around to run a program is what makes it possible to "bootstrap" your language without having it end up running "on top of" layers and layers of other languages.
[1] Most JVMs do translation behind the scenes, but they're not really translators in that the interface to the JVM is not "input language -> output language".
Actually you can write in almost any language you like to. There's nothing that prevents you from writing a C compiler in Ruby. "All" you have to do is parse the program and emit the corresponding machine code. If you can read/write files, your programming language will probably suffice.
If you're starting from scratch on a new platform, you can do cross-compiling: write a compiler for your new platform, that runs in Java or natively on x86. Develop on your PC and then transfer the program to your new target platform.
The most basic compilers are probably Assembler and C.
Generally you can use just about whatever language you like. PHP was written in C, for example. If you have no access to any compiler whatsoever, you're going to have to resort to writing assembly language and compiling it to machine code by hand.
Many languages were first written in another available language and then reimplemented in itself and bootstrapped that way (or just kept the implementation in the foreign language, like PHP and perl), but some languages, like the first assembler was hand compiled to machine code like the first C-compiler was hand compiled to assembly.
I've been interested in bootstrapping ever since I read about it. To learn more I tried doing it myself by writing my own superset of BF, which i called EBF, in itself. the first version of EBF had 3 extra primitives and I hand compiled the first binary. I found a two step rhythm when doing so. I implemented a feature in the current language in one release and had a sweet release where I rewrote the code to utilize the implemented feature. The language was expressive enough to be used to make a LISP interpreter.
I have the hand compiled version together with the source in the first release tag and the code is quite small. The last version is 12 times bigger in size and the code and allows for more compact code so hand compiling the current version would be hard to get right.
Edmund Grimley Evans did something similar with his HEX language
One of the interesting things about doing this yourself is that you understand why some things are as they are. My code was product if small incremental adjustments an it looks more like it has evolved rather than been designed from scratch. I keep that in mind when reading code today which I think looks a little off.
Usually with a general-purpose programming language suitable for systems development, e.g. C, Haskell, ML, Lisp, etc., but the list of options is long. Also, usually with some domain-specific languages for language implementation, i.e. parser and lexical analyzer generators, intermediate languages like LLVM, etc. And probably some shell scripts, testing frameworks, and a build configuration system, e.g. autoconf.
Most compiler were wriiten C or a c like program if not c then assembly lang is the way to go However when writing a new lang from scratch and you do not have a macro lib or source code from a prototype language you have to define your own functions Now in What Language? You can just write a Form "of source code called psedocode to the machine it looks like a bnf grammar from the object oriented structured lang spec like Fortran basic algo lisp. So image writing a cross code resembling any of these language syntax That's psedo code
What are programming languages in general?
programming languages are a just a way to talk to computers. roughly speaking at first because computers could only understand zeros and ones (due to the fact that computers are made of transistors as switches which could only take two states, we call these two states 0 and 1) and working with 0,1 was hard for us as humans so computer scientists decided to do a one-to-one mapping from every instruction in binary(0,1) to a more human readable form which they called it assembly language.
for example if we had an instruction like:
11001101
in assembly it would be called:
LOAD_A 15
which means that load the content of register a into memory location 15. as i said it was just a convention like choosing 0 and 1 for two states of the transistors or anything else in the computer.in this way having a program with 50 instructions , remembering the assembly language would be easier . so the user would write the assembly code and some program (assembler in this case) would translate the codes to binary instructions or machine language as they call it.
but then with the computers getting improved every day there was room for more complicated programs with more instructions, say 10000.
in this case a one-to-one mapping like assembly wouldn't work, so other high level programming languages were created. they said for example if for a relation with I/O devices for printing something on the screen created by the user takes about 80 instructions , let us do something in here and we could package all this code into one library and call it for example printf and also create another program which could translate this printf in here to the related assembly code and from there the assembly would do the rest. so they call it compiler.
so now every user who wants to just print something on the screen he wouldn't have to write all the instructions in binary or assembly he just types printf("something") and all the programs like the compiler and assembler would do the rest. now later other longer codes would be packaged in the same way to just facilitate the work of other people as you see that you could just simplify a thousands line of code into one code in python and pack it for the use of other people.
so let's say that you have packed a lot of different codes in python and created a module(libray, package or anything that you want to call it) and you call that module mgh(just my name). now let's say we have created this mgh somehow that any one who says:
import mgh
mgh.connect(ip,port.data)...
could easily connect to a remote server with the ip and port number specified and send the data afterwards(or something like that). now people could do all of it using one single line, but what that happens is that a lot of codes are getting executed which have been retrieved from the mgh file. and packaging it has not been for speeding up the process of execution but rather facilitating other programmers works. so in here if someone wants to use your code first he should import the file and then the python interpreter would recognize all the code in it and so it could interpret the code.
now if you want to create a programming language and you want to execute it , first it needs a translation, for example let's say that you create a program which could understand the syntax and convert it to c , in this case after it has been translated to c , the rest would be taken care of , by the c compiler , then assembler , linker, ... .
even though you would have to pay the price of being slower since it has to be converted to c first.
now one other thing that you could do is to create a program which could translate all the code to the equivalent assembly language just like what happens with c but in this case the program could do it directly and from there the rest would be done by the linker. we know that this program is called compiler.
so what i am talking about is that, the only code that the system understands is 0,1 , so somehow you should convert you syntax to that, now in our operating systems a lot of different programs like assembler, linker and ... have been created to tell you that if you could convert your code to assembly they could take care of the rest or as i said you could even use other programming languages compilers by converting your code to that language.
Even further binary ,or assembly operations must be translated into functions, thats the assemblers/compilers job, then into object,from data and functions, if you don't have a source file to see" how these objects functionality should be represented in your language implementation ,Then you have to recognize "see" implement, or define your own functions ,procedures, and data structures, Which requires a lot of knowledge, you need to ask yourself what is a function.Your mind then becomes the language simulation.This Separate a Master programmer from the rest.
I too had this question few months back. And I read few articles and watched some videos which helped me to start writing my own language called soft. Its not complete yet but I learned a lot of stuff from this journey.
Basic things you should know is how compiler works when it has to execute a code snippet. Compiler has a lot of phases like lexical analysis, semantic analyzer, AST(Abstract Syntax Tree) etc.
What I did in my new language can be found here - http://www.singhajit.com/writing-a-new-programming-language/
If you are writing a language for first time then all the best and you have a long way to go.

How would one go about testing an interpreter or a compiler?

I've been experimenting with creating an interpreter for Brainfuck, and while quite simple to make and get up and running, part of me wants to be able to run tests against it. I can't seem to fathom how many tests one might have to write to test all the possible instruction combinations to ensure that the implementation is proper.
Obviously, with Brainfuck, the instruction set is small, but I can't help but think that as more instructions are added, your test code would grow exponentially. More so than your typical tests at any rate.
Now, I'm about as newbie as you can get in terms of writing compilers and interpreters, so my assumptions could very well be way off base.
Basically, where do you even begin with testing on something like this?
Testing a compiler is a little different from testing some other kinds of apps, because it's OK for the compiler to produce different assembly-code versions of a program as long as they all do the right thing. However, if you're just testing an interpreter, it's pretty much the same as any other text-based application. Here is a Unix-centric view:
You will want to build up a regression test suite. Each test should have
Source code you will interpret, say test001.bf
Standard input to the program you will interpret, say test001.0
What you expect the interpreter to produce on standard output, say test001.1
What you expect the interpreter to produce on standard error, say test001.2 (you care about standard error because you want to test your interpreter's error messages)
You will need a "run test" script that does something like the following
function fail {
echo "Unexpected differences on $1:"
diff $2 $3
exit 1
}
for testname
do
tmp1=$(tempfile)
tmp2=$(tempfile)
brainfuck $testname.bf < $testname.0 > $tmp1 2> $tmp2
[ cmp -s $testname.1 $tmp1 ] || fail "stdout" $testname.1 $tmp1
[ cmp -s $testname.2 $tmp2 ] || fail "stderr" $testname.2 $tmp2
done
You will find it helpful to have a "create test" script that does something like
brainfuck $testname.bf < $testname.0 > $testname.1 2> $testname.2
You run this only when you're totally confident that the interpreter works for that case.
You keep your test suite under source control.
It's convenient to embellish your test script so you can leave out files that are expected to be empty.
Any time anything changes, you re-run all the tests. You probably also re-run them all nightly via a cron job.
Finally, you want to add enough tests to get good test coverage of your compiler's source code. The quality of coverage tools varies widely, but GNU Gcov is an adequate coverage tool.
Good luck with your interpreter! If you want to see a lovingly crafted but not very well documented testing infrastructure, go look at the test2 directory for the Quick C-- compiler.
I don't think there's anything 'special' about testing a compiler; in a sense it's almost easier than testing some programs, since a compiler has such a basic high-level summary - you hand in source, it gives you back (possibly) compiled code and (possibly) a set of diagnostic messages.
Like any complex software entity, there will be many code paths, but since it's all very data-oriented (text in, text and bytes out) it's straightforward to author tests.
I’ve written an article on compiler testing, the original conclusion of which (slightly toned down for publication) was: It’s morally wrong to reinvent the wheel. Unless you already know all about the preexisting solutions and have a very good reason for ignoring them, you should start by looking at the tools that already exist. The easiest place to start is Gnu C Torture, but bear in mind that it’s based on Deja Gnu, which has, shall we say, issues. (It took me six attempts even to get the maintainer to allow a critical bug report about the Hello World example onto the mailing list.)
I’ll immodestly suggest that you look at the following as a starting place for tools to investigate:
Software: Practice and Experience April 2007. (Payware, not available to the general public---free preprint at http://pobox.com/~flash/Practical_Testing_of_C99.pdf.
http://en.wikipedia.org/wiki/Compiler_correctness#Testing (Largely written by me.)
Compiler testing bibliography (Please let me know of any updates I’ve missed.)
In the case of brainfuck, I think testing it should be done with brainfuck scripts. I would test the following, though:
1: Are all the cells initialized to 0
2: What happens when you decrement the data pointer when it's currently pointing to the first cell? Does it wrap? Does it point to invalid memory?
3: What happens when you increment the data pointer when it's pointing at the last cell? Does it wrap? Does it point to invalid memory
4: Does output function correctly
5: Does input function correctly
6: Does the [ ] stuff work correctly
7: What happens when you increment a byte more than 255 times, does it wrap to 0 properly, or is it incorrectly treated as an integer or other value.
More tests are possible too, but this is probably where i'd start. I wrote a BF compiler a few years ago, and that had a few extra tests. Particularly I tested the [ ] stuff heavily, by having a lot of code inside the block, since an early version of my code generator had issues there (on x86 using a jxx I had issues when the block produced more than 128 bytes or so of code, resulting in invalid x86 asm).
You can test with some already written apps.
The secret is to:
Separate the concerns
Observe the law of Demeter
Inject your dependencies
Well, software that is hard to test is a sign that the developer wrote it like it's 1985. Sorry to say that, but utilizing the three principles I presented here, even line numbered BASIC would be unit testable (it IS possible to inject dependencies into BASIC, because you can do "goto variable".