Ways to make a D program faster - optimization

I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).

Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

Related

How to debug moonscript?

I trying to write some game, based on Love2d framework, compiled from moonscript. Every time when I make a mistake in my code, my application throws error and this error refers to compiled lua-code, but not a moonscript, so I have no idea where exactly this error happens. Tell me please, what a solution in this situation? Thanks.
Moonscript does support source-mapping/error-rewriting, but it is only supported when running in the moon interpreter: https://moonscript.org/reference/command_line.html#error_rewriting
I think it could be enabled in another lua environment but I am not completely sure what would be involved.
It would definetely require moonscript to hold on to the source-map tables that are created during compilation, so you couldn't use moonc; instead use the moonscript module to just-in-time compile require'd modules:
main.lua
-- attempt to require moonscript,
-- for development
pcall(require, 'moonscript')
-- load the main file
require 'init'
init.moon
love.draw = ->
print "test"
with this code and moonscript properly installed you can just run the project using love . as normal. The require 'moonscript' call will change require to compile moonscript modules on-the-fly. The performance penalty is negligible and after all modules have been loaded there is no difference.
Debugging is a problem for pretty much any source-to-source compilation system. The target language has no idea that the original language exists, so it can only talk about things in terms of the target language. The more divergent the target and original languages are, the more difficult debugging will be.
This is a big part of the reason why C++ compilers don't compile to C anymore.
The only real way to deal with this is to become intimately familiar with how the Moonscript compiler generates Lua from your Moonscript code. Learn Lua and carefully read the output Lua, comparing it to the given Moonscript. That will make it easier for you to map the given Lua error and source code to the actual Moonscript code that created it.

Pimp my VM (about performance and jitting)

for one of my programs I needed a scripting language to dynamically change the world (unit ai, world generation etc), So I wrote a Compiler for a rather basic language (simple objects without inheritance, 1d arrays, 32 bit ints/floats, strings) which also uses reference counting for garbage collection. The Compiler outputs stack based bytecode.
My problem now is that my VM isnt efficient enough (it actually runs 15-30 times slower than unoptimised C). Its a really simple VM which implements decoding with a giant SWITCH-CASE block.
the vm code looks like this:
switch(*ip++)
case ADD:
...
break;
case SUB:
...
break;
So my question is if it is possible to recompile my scripts to x86 assembler and execute them them at runtime. (I think thats what JIT compilers do). I googled a lot but I didn´t found any code samples for example to send x86 code to the processor. If anyone has links to tutorials that explain how to build better VM´s I would be very happy.

Is it easy to achieve -O3 level of optimization with using LLVM directly?

Is it easy to achieve high level of optimization with LLVM?
To give a concrete example let's assume that I have a simple lanuage that I want to write a compiler for.
simple functions
simple structs
tables
pointers (with arithmetic)
control structures
etc.
I can quite easily create compilation-to-C backend and rely on clang -O3.
Is it as easy to use LLVM API for that purpose?
Except perhaps for a few high-level (as in, aware of high-level language features or details that aren't encoded in LLVM IR) optimizations, Clang's backend does little more than generate straightforward IR and run some set of LLVM optimization passes on it. All of these (or at least most) should be available trough the opt command and also as API calls when using the C++ libraries that all LLVM tools are built on. See the tutorial for a simple example. I see several advantages:
LLVM IR is far simpler than C and there's already a convenient API for generating it programatically. To generate C, you either have lots of ugly and unreliable string fiddling or have to build an AST for the C language yourself. Or both.
You get to choose the set of optimizations yourself (it's quite possible that Clang's set of passes isn't ideal for the code the language supports and the IR representation your compiler generates). This also means you can, during development, just run the passes checking for IR wellformedness (uncovering compiler bugs faster). You can just copy Clang's pass order, but if you feel like it, you can also experiment.
It will allow better compile times. Clang is fast for a C compiler, but you'd be adding unnecessary overhead: You generate C code, then Clang parses it, converts it to IR, and goes on to do pretty much what you could do right away.
You may have access to a broader range of features, or at least you'd get them easier (i.e. without having to incorporate #defines, obscure pragmas, instrincts or command line options) to provide them. I'm talking about like vectors, guaranteed (well, more than in C anyway - AFAIK, some code generators ignore them) tail calls, pure/readonly functions, more control over memory layout and type conversions (for instance zero extending vs. sign extending). Granted, you may not need most of them.
LLVM has built-in optimization passes so that you can achieve O3-like optimizations using API.

Porting newlib to a custom ARM setup

this is my first post, and it covers something which I've been trying to get working on and off for about a year now.
Essentially it boils down to the following: I have a copy of newlib which I'm trying to get working on an LPC2388 (an ARM7TDMI from NXP). This is on a linux box using arm-elf-gcc
The question I have is that I've been looking at a lot of the tutorials talking about porting newlib, and they all talk about the stubs (like exit, open, read/write, sbrk), and I have a pretty good idea of how to implement all of these functions. But where should I put them?
I have the newlib distribution from sources.redhat.com/pub/newlib/newlib-1.18.0.tar.gz and after poking around I found "syscalls.c" (in newlib-1.18.0/newlib/libc/sys/arm) which contains all of the stubs which I have to update, but they're all filled in with rather finished looking code (which does NOT seem to work without the crt0.S, which itself does not work with my chip).
Should I just be wiping out those functions myself, and re-writing them? Or should I write them somewhere else. Should I make a whole new folder in newlib/libc/sys with the name of my "architecture" and change the target to match?
I'm also curious if there's proper etiquette on distribution of something like this after releasing it as an open source project. I currently have a script which downloads binutils, arm-elf-gcc, newlib, and gdb, and compiles them. If I am modifying files which are in the newlib directory, should I hand a patch which my script auto-applies? Or should I add the modified newlib to the repository?
Thanks for bothering to read! Following this is a more detailed breakdown of what I'm doing.
For those who want/need more info about my setup:
I'm building a ARM videogame console based loosely on the Uzebox project ( http://belogic.com/uzebox/ ).
I've been doing all sorts of things pulling from a lot of different resources as I try and figure it out. You can read about the start of my adventures here (sparkfun forums, no one responds as I figure it out on my own): forum.sparkfun.com/viewtopic.php?f=11&t=22072
I followed all of this by reading through the Stackoverflow questions about porting newlib and saw a few of the different tutorials (like wiki.osdev.org/Porting_Newlib ) but they also suffer from telling me to implements stubs without mentioning where, who, what, when, or how!
But where should I put them?
You can put them where you like, so long as they exist in the final link. You might incorporate them in the libc library itself, or you might keep that generic, and have the syscalls as a separate target specific object file or library.
You may need to create your own target specific crt0.s and assemble and link it for your target.
A good tutorial by Miro Samek of Quantum Leaps on getting GNU/ARM development up and running is available here. The examples are based on an Atmel AT91 part so you will need to know a little about your NXP device to adapt the start-up code.
A ready made Newlib porting layer for LPC2xxx was available here, but the links ot teh files appear to be broken. The same porting layer is used in Martin Thomas' WinARM project. This is a Windows port of GNU ARM GCC, but the examples included in it are target specific not host specific.
You should only need to modify the porting layer on Newlib, and since it is target and application specific, you need not (in fact probably should not) submit your code to the project.
When I was using newlib that is exactly what I did, blew away crt0.s, syscalls.c and libcfunc.c. My personal preference was to link in the replacement for crt0.s and syscalls.c (rolled the few functions in libcfunc into the syscalls.c replacement) based on the embedded application.
I never had an interest in pushing any of that work back into the distro, so cannot help you there.
You are on the right path though, crt0.S and syscalls.c are where you want to work to customize for your target. Personally I was interested in a C library (and printf) and would primarily neuter all of the functions to return 0 or 1 or whatever it took to get the function to just work and not get in the way of linking, periodically making the file I/O functions operate on linked in data in rom/ram. Basically without replacing or modifying any other files in newlib I had a fair amount of success, so you are on the right path.

Any Macro or Technic for Part Optimization?

I am working on lock free structure with g++ compiler. It seems that with -o1 switch, g++ will change the execution order of my code. How can I forbid g++'s optimization on certain part of my code while maintain the optimization to other part? I know I can split it to two files and link them, but it looks ugly.
If you find that gcc changes the order of execution in your code, you should consider using a memory barrier. Just don't assume that volatile variables will protect you from that issue. They will only make sure that in a single thread, the behavior is what the language guarantees, and will always read variables from their memory location to account for changes "invisible" to the executing code. (e.g changes to a variable done by a signal handler).
GCC supports OpenMP since version 4.2. You can use it to create a memory barrier with a special #pragma directive.
A very good insight about locking free code is this PDF by Herb Sutter and Andrei Alexandrescu: C++ and the Perils of Double-Checked Locking
You can use a function attribute "__attribute__ ((optimize 0))" to set the optimization for a single function, or "#pragma GCC optimize" for a block of code. These are only for GCC 4.4, though, I think - check your GCC manual. If they aren't supported, separation of the source is your only option.
I would also say, though, that if your code fails with optimization turned on, it is most likely that your code is just wrong, especially as you're trying to do something that is fundamentally very difficult. The processor will potentially perform reordering on your code (within the limits of sequential consistency) so any re-ordering that you're getting with GCC could potentially occur anyway.