Why don't compilers generate microinstructions rather than assembly code? - optimization

I would like to know why, in the real world, compilers produce Assembly code, rather than microinstructions.
If you're already bound to one architecture, why not go one step further and free the processor from having to turn assembly-code into microinstructions at Runtime?
I think perhaps there's a implementation bottleneck somewhere but I haven't found anything on Google.
EDIT by microinstructions I mean: if you assembly instruction is ADD(R1,R2), the microinstructions would be. Load R1 to the ALU, load R2 to the ALU, execute the operation, load the results back onto R1. Another way to see this is to equate one microinstruction to one clock-cycle.
I was under the impression that microinstruction was the 'official' name. Apparently there's some mileage variation here.

Compilers don't produce micro-instructions because processors don't execute micro-instructions. They are an implementation detail of the chip, not something exposed outside the chip. There's no way to provide micro-instructions to a chip.

Because an x86 CPU doesn't execute micro operations, it executes opcodes. You can not create a binary image that contains micro operations since there is no way to encode them in a way that the CPU understands.
What you are suggesting is basically a new RISC-style instruction set for x86 CPUs. The reason that isn't happening is because it would break compatibility with the vast amount of applications and operating systems written for the x86 instruction set.

The answer is quite easy.
(Some) compilers do indeed generate code sequences like load r1, load r2, add r2 to r1. But this are precisely the machine code instructions (that you call microcode). These instructions are the one and only interface between the outer world and the innards of a processor.
(Other compilers generate just C and let a C backend like gcc care about the dirty details.)


Ways to make a D program faster

I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).
Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

How can I relocate CP/M BDOS to a custom memory address?

Maybe it's a newbie CP/M question, but anyway ... Is it possible to relocate CP/M BDOS? I have a hardware I've written BIOS for, to be able to use with CPM 2.2. However that BDOS (seen by disassembling it) uses fixed addresses. Since I don't know CP/M to well, I have no idea how to place CP/M BDOS to another start address. The only (somewhat ugly!) solution I could figure out: I found a CP/M disassembly list, so I've simply modified the "ORG" directive and I re-assembled it. Is there any other way, eg some CP/M utilty? And if so, how it can do that, since BDOS uses JP, CALL etc opcodes (sorry I am just familiar with Z80, not so much with original 8080 assembly) so it's not simply PC independent. Thanks!
No need for a disassembly; the original CP/M source code is available (and, yes, BDOS and everything else resident is assembly, not PL/M). Within the "CP/M 2.2 ORIGINAL SOURCE" offered there you should find both OS3BDOS.ASM and OS3BDOS1.ASM. These are both different released versions of the CP/M 2.2 BDOS source (see README.TXT); you should be able to adjust the org and rebuild either of them, using the assembler also provided in the archive.
Alternatively you can use the MOVCPM tool (also included in the archive). It's intended to relocate the BDOS and the supplied BIOS but there's nothing to stop you replacing the BIOS after the fact.
Possibly of interest to you if you'd prefer to write a cross-relocator: from a quick bit of research, the interesting bit is this from the BDOS source:
if test
org 0dc00h
org 0800h
Why would the BDOS ever be at 0800 on any useful machine? Why is dc00 a 'test' address? Because the relocation is handled very trivially: BDOS is built once at 0800 and once at dc00. Through a binary compare of those two builds any differences must be where correct addresses need to be inserted, and the difference from the original org value tells you how to calculate the value to insert.

Pimp my VM (about performance and jitting)

for one of my programs I needed a scripting language to dynamically change the world (unit ai, world generation etc), So I wrote a Compiler for a rather basic language (simple objects without inheritance, 1d arrays, 32 bit ints/floats, strings) which also uses reference counting for garbage collection. The Compiler outputs stack based bytecode.
My problem now is that my VM isnt efficient enough (it actually runs 15-30 times slower than unoptimised C). Its a really simple VM which implements decoding with a giant SWITCH-CASE block.
the vm code looks like this:
case ADD:
case SUB:
So my question is if it is possible to recompile my scripts to x86 assembler and execute them them at runtime. (I think thats what JIT compilers do). I googled a lot but I didn´t found any code samples for example to send x86 code to the processor. If anyone has links to tutorials that explain how to build better VM´s I would be very happy.

Don't Both JIT and non-JIT enabled Interpreters Ultimately Produce Machine Code

Ok, I have read several discussions regarding the differences between JIT and non-JIT enabled interpreters, and why JIT usually boosts performance.
However, my question is:
Ultimately, doesn't a non-JIT enabled interpreter have to turn bytecode (line by line) into machine/native code to be executed, just like a JIT compiler will do? I've seen posts and textbooks that say it does, and posts that say it does not. The latter argument is that the interpreter/JVM executes this bytecode directly with no interaction with machine/native code.
If non-JIT interpreters do turn each line into machine code, it seems that the primary benefits of JIT are...
The intelligence of caching either all (normal JIT) or frequently encountered (hotspot/adaptive optimization) parts of the bytecode so that the machine code compilation step is not needed every time.
Any optimization JIT compilers can perform in translating bytecode into machine code.
Is that accurate? There seems to be little difference (other than possible optimization, or JITting blocks vs line by line maybe) between the translation of bytecode to machine code via non-JIT and JIT enabled interpreters.
Thanks in advance.
A non-JIT interpreter doesn't convert bytecode to machine code. You can imagine the workings of a non-JIT bytecode interpreter something like this (I'll use a Java-like pseudocode):
int[] bytecodes = { ... };
int ip = 0; // instruction pointer
while(true) {
int code = bytecodes[ip];
switch(code) {
case 0;
// do something
ip += 1; break;
case 1:
// do something else
ip += 1; break;
// and so on...
So for every bytecode executed, the interpreter has to retrieve the code, switch on its value to decide what to do, and increment its "instruction pointer" before going to the next iteration.
With a JIT, all that overhead would be reduced to nothing. It would just take the contents of the appropriate switch branches (the parts that say "// do something"), string them together in memory, and execute a jump to the beginning of the first one. No software "instruction pointer" would be required -- only the CPU's hardware instruction pointer. No retrieving of bytecodes from memory and switching on their values either.
Writing a virtual machine is not difficult (if it doesn't have to be extremely high performance), and can be an interesting exercise. I did one once for an embedded project where the program code had to be very compact.
Decades ago, there seemed to be a widespread belief that compilers would turn an entire program into machine code, while interpreters would translate a statement into machine code, execute it, discard it, translate the next one, etc. That notion was 99% incorrect, but there were two a tiny kernels of truth to it. On some microprocessors, some instructions required the use of addresses that were specified in code. For example, on the 8080, there was an instruction to read or write a specified I/O address 0x00-0xFF, but there was no instruction to read-or write an I/O address specified in a register. It was common for language interpreters, if user code did something like "out 123,45", to store into three bytes of memory the instructions "out 7Bh/ret", load the accumulator with 2Dh, and make a call to the first of those instructions. In that situation, the interpreter would indeed be producing a machine code instruction to execute the interpreted instruction. Such code generation, however, was mostly limited to things like IN and OUT instructions.
Many common Microsoft BASIC interpreters for the 6502 (and perhaps the 8080 as well) made somewhat more extensive use of code stored in RAM, but the code that was stored in RAM not not significantly depend upon the program that was executing; the majority of the RAM routine would not change during program execution, but the address of the next instruction was kept in-line as part of the routine allowing the use of an absolute-mode "LDA" instruction, which saved at least one cycle off every byte fetch.

Using open source SNES emulator code to turn a rom file into a self-contained executable game

Would it be possible to take the source code from a SNES emulator (or any other game system emulator for that matter) and a game ROM for the system, and somehow create a single self-contained executable that lets you play that particular ROM without needing either the individual rom or the emulator itself to play? Would it be difficult, assuming you've already got the rom and the emulator source code to work with?
It shouldn't be too difficult if you have the emulator source code. You can use a method that is often used to store images in c source files.
Basically, what you need to do is create a char * variable in a header file, and store the contents of the rom file in that variable. You may want to write a script to automate this for you.
Then, you will need to alter the source code so that instead of reading the rom in from a file, it uses the in memory version of the rom, stored in your variable and included from your header file.
It may require a little bit of work if you need to emulate file pointers and such, or you may be lucky and find that the rom loading function just loads the whole file in at once. In this case it would probably be as simple as replacing the file load function with a function to return your pointer.
However, be careful for licensing issues. If the emulator is licensed under the GPL, you may not be legally allowed to store a proprietary file in the executable, so it would be worth checking that, especially before you release / distribute it (if you plan to do so).
Yes, more than possible, been done many times. Google: static binary translation. Graham Toal has a good howto paper on the subject, should show up early in the hits. There may be some code out there I may have left some code out there.
Completely removing the rom may be a bit more work than you think, but not using an emulator, definitely possible. Actually, both requirements are possible and you may be surprised how many of the handheld console games or set top box games are translated and not emulated. Esp platforms like those from Nintendo where there isnt enough processing power to emulate in real time.
You need a good emulator as a reference and/or write your own emulator as a reference. Then you need to write a disassembler, then you have that disassembler generate C code (please dont try to translate directly to another target, I made that mistake once, C is portable and the compilers will take care of a lot of dead code elimination for you). So an instruction of a make believe instruction set might be:
add r0,r0,#2
And that may translate into:
//add r0,r0,#2
It looks like the SNES is related to the 6502 which is what Asteroids used, which is the translation I have been working on off and on for a while now as a hobby. The emulator you are using is probably written and tuned for runtime performance and may be difficult at best to use as a reference and to check in lock step with the translated code. The 6502 is nice because compared to say the z80 there really are not that many instructions. As with any variable word length instruction set the disassembler is your first big hurdle. Do not think linearly, think execution order, think like an emulator, you cannot linearly translate instructions from zero to N or N down to zero. You have to follow all the possible execution paths, marking bytes in the rom as being the first byte of an instruction, and not the first byte of an instruction. Some bytes you can decode as data and if you choose mark those, otherwise assume all other bytes are data or fill. Figuring out what to do with this data to get rid of the rom is the problem with getting rid of the rom. Some code addresses data directly others use register indirect meaning at translation time you have no idea where that data is or how much of it there is. Once you have marked all the starting bytes for instructions then it is a trivial task to walk the rom from zero to N disassembling and or translating.
Good luck, enjoy, it is well worth the experience.