As far as I know a JVM can work in different ways:
Interpreter: Runtime translation from bytecode to native code over and over again.
Just in time compilation: Compile parts of bytecode into native code at runtime when needed. Keeping the compilations. Performance overhead / penalty for compilation. Introduces possibilities for adaptive optimization at runtime which is not possible with static ahead of time compilation.
Hotspot: Only frequently executed parts get JIT compiled. The rest gets interpreted.
Now GraalVM can compile do ahead of time compilation of bytecode into native code.
Is it possible to compile bytecode ahead of time and do adaptive optimization on hotspots (in general and with GraalVM in special)?
[Clarification]
I don't mean to AOT compile parts of the bytecode to native code and leave other parts as bytecode to perform hotspot JIT compilation of them at runtime. This is what IBM's Excelsior Jet Java implementation seems to do, what I have read so far.
I mean AOT compile the whole bytecode and replacing the hotspots parts at runtime with adaptively optimized recompilations of hotspots. Which requires connecting / inserting the optimized code properly into the existing AOT compiled code.
[/clarification]
I don't know what information is needed for recompilation of hotspots with adaptive optimization at runtime . Is the bytecode needed to do it? This would mean a higher memory consumption as cost for higher performance.
I am not an expert to this, so please tell me if any assumption are wrong.
Refer to JEP 295. It mentions different AOT modes, including tiered AoT which delivers C1-compiled code with profiling instrumentation which can then optimized with C2 at runtime.
Related
I'm wondering why a GraalVM (SubstrateVM) native image of a Java application makes it run where the runtime behavior will consume much less memory, yet if run normally, it will consume a lot more memory?
And why can't the normal JIT be made to similarly consume a small amount of memory?
GraalVM native images don't include the JIT compiler or the related infrastructure -- so there's no need to allocate memory for JIT, for the internal representation of the program to JIT it (for example a control flow graph), no need to store some of the class metadata, etc.
So it's unlikely that a JIT which actually does useful work can be implemented with the same zero overhead.
It could be possible to create an economic implementation of the virtual machine that will perhaps use less memory than HotSpot. Especially if you only want to measure the default configuration without comparing the setups where you control the amounts of memory the JVM is allowed to use. However, one needs to realize that it'll either be an incremental improvement on the existing implementations or picking a different option for some trade-off, because the existing JVM implementations are actually really-really good.
Is customizing JIT a viable alternative to some of the JNI use-cases?
Edit:
For the purpose of improving performance via custom hardware such as GPU or FPGA. I am aware of Project Sumatra but I believe it is GPU focus and it is no longer active. Optimization via JNI requires explicit specification at the source, while things like intrinsic in JIT (may be a more custom and complex version of intrinsic) can make optimization a bit more automatic?
I have one relatively complicated shader, which I want to compile.
Shader has ~700 lines, which are compiled into ~3000 instructions.
Compilation time is with fxc (Windows 8 SDK) about 90 seconds.
I have another shader of similar size and compilation time is 20 seconds.
So here are my questions:
Is possible to speed up the compilation from application viewpoint (faster version of fxc or fxc alternative)?
Is possible to speed up the compilation from code view point (is code constructs which massively slows down the compilation - which ones, how avoid them)?
Is possible to speed up the compilation from fxc settings viewpoint (some secre options as --fast-compile or whatever)?
Edit:
Parallel thread on msdn forum:
https://social.msdn.microsoft.com/Forums/en-US/5e60c68e-8902-48d6-b497-e86ac4f2cfe7/hlsl-compilation-speed?forum=vclanguage
There is no "faster" fxc or d3dcompile library.
You can different things to speed up things, turn off optimisation is one of them, as the driver will optimise anyway from dxbc to final microcode.
But the best advice is to implement a shader cache, if you for example pre-process and hash the shader file and trigger compilation only if it is actually different, you will save time.
The d3dcompile library is multi thread safe, and you want to take advantage of multi core CPU. Implementing the include interface to cache file load can be valuable too if you compile many shader.
Finally, when everything fail, you have no choice but experiment and find what takes that long, and do some rewrite, sometimes, a [branch] or [unroll] on the culprit may be enough to solve the compilation time.
Why is the shader compilation time a problem? Fxc is an offline compiler, meaning that the resulting bytecode is hardware independent and can be distributed with your application.
If you're looking to cut down on the iteration times during development, disabling optimalizations with the "/Od" command line option should help.
I have some questions about Just-In-Time (JIT) compilation with CUDA.
I have implemented a library based on Expression Templates according to the paper
J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition
It seems to work fairly good. If I compare the computing time of the matrix elementwise operation
D_D=A_D*B_D-sin(C_D)+3.;
with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):
time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)
time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)
The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is
1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?
It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:
2. Would JIT help in hiding the implementation to a third-party user? If yes, how?
The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:
3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?
Thank you very much for any help.
EDIT following Talonmies' comment
From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that
cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime
A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.
I think you should look into OpenCL. It provides a JIT-like programming model for creating, compiling, and executing compute kernels on GPUs (all at run-time).
I take a similar, expression-template based approach in Boost.Compute which allows the library to support C++ templates and generic algorithms by translating compile-type C++ expressions into OpenCL kernel code (which is a dialect of C).
VexCL started as an expression template library for OpenCL, but since v1.0 it also supports CUDA. What it does for CUDA is exactly JIT-compilation of CUDA sources. nvcc compiler is called behind the scenes, the compiled PTX is stored in an offline cache and is loaded on subsequent launches of the program. See CUDA backend sources for how to do this. compiler.hpp should probably be of most interest for you.
I was wondering if someone has had experience with the llvm/tools - lli interpreter/JIT-compiler (cf. http://llvm.org/docs/GettingStarted.html#tools). I am interested in any information that you can provide (speed, complexity, implementations, etc.).
Thanks.
UPDATE:
Okay how would bitcode execution be compared to LuaJIT VM execution, supposing that lli acts as an interpreter? What about when lli acts as a jit-compiler (same comparison)?
NOTE:
I am only asking if anyone has experience/ is willing to spare some time to share.
LuaJIT is a tracing JIT, which means it can re-optimize itself to better suite the data passed through the execution environment, however, LLVM is a static JIT, and thus will just generate the once-off best-case machine code for the corresponding source, which may leading it it loosing performance in tight loops or bad branch misspredictions.
The actual LuaJIT VM is also highly optimized, threaded, machine specific assembly, where as LLVM uses C++ for portability (and other reasons), so this obviously gives LuaJIT a huge advantage. LLVM also has a much higher overhead than LuaJIT, purely because LuaJIT was designed to work on much less powerful systems (such as those driven by ARM CPU's).
The LuaJIT bytecode was also specially designed for LuaJIT, where as LLVM's bitcode is a lot more generic, this will obviously make LuaJIT's execute faster. LuaJIT's bytecode is also well designed for encoding optimization hints etc for use by the JIT and the tracer.
ignoring the fact that these are two different types of JITs, the whole comparison boils down to LLVM is focused on being a generic JIT/Compiler backend, LuaJIT is focused on executing Lua as fast as possible in the best way possible, thus it gains from not being constrained by generality.