TLDR version:
Is there a way to cache a stateless LESS compilation result per LESS file? We have large LESS utils (e.g. from libraries) that are reused by many pages, but compiling them (in particular, the regex parsing) seems rather slow. less.Parser does produce an abstract syntax tree result, but that result doesn't seem stateless (see e.g. less.Ruleset.evalImports) so it doesn't seem suitable for caching for this purpose.
LONG VERSION:
I'm looking for some pointers about optimizing LESS server-side preprocessing at scale. We have an application built from many modules, each of which contributes some styles in the form of CSS, and also modules which provide LESS commons utilities like mixins and various variables. The application also exposes many endpoints, each of which pulls in a mixture of these dependencies.
For performance, we want to precompile and cache the styles for each endpoint, but here it collides with what seems to be LESS's fundamental design. In particular, LESS compilation feels stateful, in the sense that the same #import in different locations in different files produces different results (from a compiler-output point of view). This makes compiler caching hard, since there's no real way to make sure each LESS file is parsed once and done forever (i.e. each endpoint needs to reparse a LESS utils file for use in its own styles, and each module that wants to define its own styles also needs to do the same).
Possible approaches considered so far:
I've looked a little into caching the AST for each file during compilation (i.e. the output of less.Parser) but it seems futile, since the AST itself seems stateful (see e.g. less.Ruleset.evalImports). This is unfortunate since it seems that the parsing stage of the LESS compilation is by far the slowest portion (and can add about 20 minutes to an ~2 minute server startup time with precompilation on).
Also looked into injecting something other than an #import (reference) or the equivalent when compiling scripts. In particular, if I can convert the imports into the explicit state (set of current defined mixins/variables/etc.) from a previous compilation, I could try to inject those manually into the parsing process. Unfortunately this feels like it involves very deep hacking of less.js code, which I'd like to avoid.
Compile in the context of an endpoint (i.e. perform one giant LESS compilation for an endpoint's style dependencies). This instead bounds compilation time to (size of utils) * (# of endpoints), but doesn't help as much as it could when the number of endpoints is also very large. It also puts all the compilation into a single context, allowing for the unexpected cross-module state interactions produced by subtle ordering changes.
Topological sort everything and compile everything in one LESS compilation. This makes sure each file only needs to be traversed once, but craps all over the global namespace.
In essence, the compilation time feels bounded by (size of LESS utils) * (# of LESS compilation events). Minimizing statefulness leans toward more compilation events, but explodes compilation time. Minimizing compilation events improves performance, but sacrifices state independence.
Caching a stateless compilation result (e.g. like a precompiled header file for gcc) would break the performance barrier and achieve the optimum result, but I'm not sure how this can be done in current less.js code. Does anyone have any pointers/info in this regard?
Related
I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).
Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.
While looking for ways to speed up my simulation, I came across the --force-lto option.
I've heard about LTO (Link Time Optimization) before, so that made me wonder why isn't --force-lto the default while building gem5?
Would that make a simulation go much faster than a gem5.fast build compared to a gem5.opt build?
In gem5 fe15312aae8007967812350f8cdac9ad766dcff7 (2019), the gem5.fast build already enables LTO by default, so you generally never want to use that option explicitly, but rather want just gem5.opt.
Other things to also keep in about .fast:
it also removes -g and so you get no debug symbols. I wonder why, since that does not make runs any faster.
it also turns on NDEBUG, which has the standard library effect of disabling asserts entirely, but plus some gem5 specific effects spread throughout the code with #ifndef NDEBUG checks
it disables TRACING_ON, which makes DPRINTF and family become empty statements as seen at: src/base/trace.hh
Those effects can be seen easily at src/SConstruct.
That option exists because the more common gem5.opt build also uses partial linking, which in some versions of GCC was incompatible with LTO.
Therefore, as its the name suggests, --force-lto forces the use of LTO together with partial linking, which might not be stable. That's why I recommend that you use gem5.fast rather than touching --force-lto.
The goal of partial linking is presumably to speed up the link step, which can easily be the bottleneck in a "change on file, rebuild, relink, test" loop, although in my experiments it is not clear that it is efficient at doing that. Today it might just be a relic from the past.
To try to speed up linking, I recommend that you try scons --gold-linker instead, which uses the GOLD linker instead of ld. Note that this option was more noticeably effective for gem5.debug however.
I have found that gem5.fast is generally 20% faster than gem5.opt for Atomic CPUs.
I read about Just-in-time compilation (JIT) and as I understood, there are two approaches for this – Interpreter and JIT, both of which interpreting the bytecode at runtime.
Why not just preparatively interprete all the bytecode to machine code, and only then start to run the process with no more need for interpreter?
Another reason for late JIT compiling has to do with optimization: At run-time the VM can detect more/other patterns it may optimize than the compiler could ever do at compile-time. JIT pre-compiling at startup will always have to be static, and the same could have been done by the compiler already, but through analysis of the actual run-time behaviour the VM may have more information on possible optimizations and may therefore produce better optimization results.
For example, the VM can detect that a single piece of code is actually run a million times at run-time and perform appropriate optimizations which the compiler may have no information about, not unlike the branch prediction that's done at runtime in modern CPUs.
More information can be found in the Wikipedia article on "Adaptive optimization".
Simple: Because it takes time to precompile everything to machine code. And users don't want to wait on the application to start. Remember, the precompilation would have to make a lot of optimizations which takes time.
The server version of JVM is more aggressive in precompiling and optimizing code upfront because code on the server side tends to be executed more often and for a longer period of time before the process is shutdown.
However, a solution (for .Net) is an application called NGen which make the precompilation upfront such that it isn't needed after that point. You only have to run that once.
Not all VM's include an interpreter. For instance Chrome and CLR (.Net) always compiles to machine code before running. However, they have multiple levels of optimizations to reduce the startup time.
I found link showing how runtime recompilation can optimize performance and save extra CPU cycles.
Inlining expansion: To decrease the cost of procedure calls.
Removing redundant loads: When 2 compiled code results in some duplicate code then it can be removed and further optimised by recompilation at run time.
Copy propagation
Eliminating dead code
Here is another link for the same explanation given above.
Is it easy to achieve high level of optimization with LLVM?
To give a concrete example let's assume that I have a simple lanuage that I want to write a compiler for.
simple functions
simple structs
tables
pointers (with arithmetic)
control structures
etc.
I can quite easily create compilation-to-C backend and rely on clang -O3.
Is it as easy to use LLVM API for that purpose?
Except perhaps for a few high-level (as in, aware of high-level language features or details that aren't encoded in LLVM IR) optimizations, Clang's backend does little more than generate straightforward IR and run some set of LLVM optimization passes on it. All of these (or at least most) should be available trough the opt command and also as API calls when using the C++ libraries that all LLVM tools are built on. See the tutorial for a simple example. I see several advantages:
LLVM IR is far simpler than C and there's already a convenient API for generating it programatically. To generate C, you either have lots of ugly and unreliable string fiddling or have to build an AST for the C language yourself. Or both.
You get to choose the set of optimizations yourself (it's quite possible that Clang's set of passes isn't ideal for the code the language supports and the IR representation your compiler generates). This also means you can, during development, just run the passes checking for IR wellformedness (uncovering compiler bugs faster). You can just copy Clang's pass order, but if you feel like it, you can also experiment.
It will allow better compile times. Clang is fast for a C compiler, but you'd be adding unnecessary overhead: You generate C code, then Clang parses it, converts it to IR, and goes on to do pretty much what you could do right away.
You may have access to a broader range of features, or at least you'd get them easier (i.e. without having to incorporate #defines, obscure pragmas, instrincts or command line options) to provide them. I'm talking about like vectors, guaranteed (well, more than in C anyway - AFAIK, some code generators ignore them) tail calls, pure/readonly functions, more control over memory layout and type conversions (for instance zero extending vs. sign extending). Granted, you may not need most of them.
LLVM has built-in optimization passes so that you can achieve O3-like optimizations using API.
The .less library calls itself a port of ruby LESS library. Can I take away from that that they both are compilers for the same LESS file format or do they expect subtly different less code? Asked another way, am I locking myself in to the dotless library or can use dotless and the less javascript lib on the less files?
Dotlesscss is a straight (almost 1:1 port) of the JavaScript project less.js (a JavaScript implementation of LessCSS by Cloudhead the original author of LessCss for Ruby)
In 99% of the cases the same code that runs on dotlesscss will run on less.js and vice versa. If something works on less.js and doesn't on dotlesscss we consider that a bug and try to fix it if possible.
There are very subtle differences though as it is very hard to keep three different projects 100% synced up.
For one that would be different function names.
Examples would be the color manipulation functions that we implemented before the LessCss project, as we named these after their SASS equivalents..
But in general: the language though is 100% compatible.
You are not limiting yourself to one language. You should be able to move between different implementations fairly easily.
Also dotless runs on Mono so you are not locked to a specific OS either.
If you encounter any problems feel free to raise an Issue on our GitHub Page or through the Mailing List
They're supposed to be equivalent implemnentations however there is a hudge difference between
the server side implementations (ruby, .net, php ...)
the client side javascript implementation
The big difference is that with the client side implementation, you'll be able to use all the dom of the browser in your less files and this would never work with server side implementations :
#height: `document.body.clientHeight`;
More over, in the current version of dotless (1.2.4.0), javascript evaluation is not implemented and is rendered as [script unsupported] in the css output.