JVM step by step simulator - jvm

Is there a free JVM implementation that allow to see the content of the different parts of the Java Virtual Machine (e.g., callstack, heap) and execute a program step by step?

Once the JIT compiles the bytecode to native code, the VM registers and stack have little meaning.
I would use your debugger to see what the Java program is doing line by line. The bytecode is for a virtual machine, not an actual one and the JVM doesn't have to follow the VM literally, only what the program does.
The JIT can
use the many registers your CPU has rather than use a pure stack.
inline code rather than perform method calls.
remove code which it determines isn't used.
place objects on the stack.
not synchronize objects which are only used in a local method.
A good tool to see how the code is translated from byte code to machine code is JITWatch

Related

C++ | Adding workload to a existing thread from a injected DLL

in my project i injected a DLL(64-bit Windows 10) in to a external process with Manual-map & Thread-hijacking and i do some stuff in there.
In current state i use "RtlCreateUserThread" to create a new thread and do some extra workload in there to distribute it for better performance.
My question is now... Is it possible to access other threads from the current process (hijack it) and add your own workload/code there. Without creating a new thread?
I didn't found anything helpful yet in the internet and the code i used and modified for Thread-hijacking seems to only work for a DLL file. Because i am pretty new to C++ i am still learning i am already thankful for any help.
(If you want to see the source for injector Google GHInjector your find the library on github.)
It is possible, but so complicated and may not work in all cases.
You need to splice existing thread's machine codes, so you will need write access to code page memory.
Logic:
find thread id and thread handle, then suspend thread with SuspendThread WINAPI call
suspended thread can be in wait state or in system DLL call now, so you need to analyze current execution stack, backtrace it and find execution address from application space. You need API functions StackWalk, and PDB files in some cases. Also it depends on running architecture (x86, amd64, ...). Walk through stack until your EIP/RIP will not be in application memory address space
decode machine instruction (it will be 'call') and splice next instructions to your function call. You need to use __declspec(naked) declared function or ASM implemented one for execute your code and replaced instructions.
ResumeThread
This method may work only once because no guarantees that application code is executed in loop.

Why isn't all the java bytecode initially interpreted to machine code?

I read about Just-in-time compilation (JIT) and as I understood, there are two approaches for this – Interpreter and JIT, both of which interpreting the bytecode at runtime.
Why not just preparatively interprete all the bytecode to machine code, and only then start to run the process with no more need for interpreter?
Another reason for late JIT compiling has to do with optimization: At run-time the VM can detect more/other patterns it may optimize than the compiler could ever do at compile-time. JIT pre-compiling at startup will always have to be static, and the same could have been done by the compiler already, but through analysis of the actual run-time behaviour the VM may have more information on possible optimizations and may therefore produce better optimization results.
For example, the VM can detect that a single piece of code is actually run a million times at run-time and perform appropriate optimizations which the compiler may have no information about, not unlike the branch prediction that's done at runtime in modern CPUs.
More information can be found in the Wikipedia article on "Adaptive optimization".
Simple: Because it takes time to precompile everything to machine code. And users don't want to wait on the application to start. Remember, the precompilation would have to make a lot of optimizations which takes time.
The server version of JVM is more aggressive in precompiling and optimizing code upfront because code on the server side tends to be executed more often and for a longer period of time before the process is shutdown.
However, a solution (for .Net) is an application called NGen which make the precompilation upfront such that it isn't needed after that point. You only have to run that once.
Not all VM's include an interpreter. For instance Chrome and CLR (.Net) always compiles to machine code before running. However, they have multiple levels of optimizations to reduce the startup time.
I found link showing how runtime recompilation can optimize performance and save extra CPU cycles.
Inlining expansion: To decrease the cost of procedure calls.
Removing redundant loads: When 2 compiled code results in some duplicate code then it can be removed and further optimised by recompilation at run time.
Copy propagation
Eliminating dead code
Here is another link for the same explanation given above.

Don't Both JIT and non-JIT enabled Interpreters Ultimately Produce Machine Code

Ok, I have read several discussions regarding the differences between JIT and non-JIT enabled interpreters, and why JIT usually boosts performance.
However, my question is:
Ultimately, doesn't a non-JIT enabled interpreter have to turn bytecode (line by line) into machine/native code to be executed, just like a JIT compiler will do? I've seen posts and textbooks that say it does, and posts that say it does not. The latter argument is that the interpreter/JVM executes this bytecode directly with no interaction with machine/native code.
If non-JIT interpreters do turn each line into machine code, it seems that the primary benefits of JIT are...
The intelligence of caching either all (normal JIT) or frequently encountered (hotspot/adaptive optimization) parts of the bytecode so that the machine code compilation step is not needed every time.
Any optimization JIT compilers can perform in translating bytecode into machine code.
Is that accurate? There seems to be little difference (other than possible optimization, or JITting blocks vs line by line maybe) between the translation of bytecode to machine code via non-JIT and JIT enabled interpreters.
Thanks in advance.
A non-JIT interpreter doesn't convert bytecode to machine code. You can imagine the workings of a non-JIT bytecode interpreter something like this (I'll use a Java-like pseudocode):
int[] bytecodes = { ... };
int ip = 0; // instruction pointer
while(true) {
int code = bytecodes[ip];
switch(code) {
case 0;
// do something
ip += 1; break;
case 1:
// do something else
ip += 1; break;
// and so on...
}
}
So for every bytecode executed, the interpreter has to retrieve the code, switch on its value to decide what to do, and increment its "instruction pointer" before going to the next iteration.
With a JIT, all that overhead would be reduced to nothing. It would just take the contents of the appropriate switch branches (the parts that say "// do something"), string them together in memory, and execute a jump to the beginning of the first one. No software "instruction pointer" would be required -- only the CPU's hardware instruction pointer. No retrieving of bytecodes from memory and switching on their values either.
Writing a virtual machine is not difficult (if it doesn't have to be extremely high performance), and can be an interesting exercise. I did one once for an embedded project where the program code had to be very compact.
Decades ago, there seemed to be a widespread belief that compilers would turn an entire program into machine code, while interpreters would translate a statement into machine code, execute it, discard it, translate the next one, etc. That notion was 99% incorrect, but there were two a tiny kernels of truth to it. On some microprocessors, some instructions required the use of addresses that were specified in code. For example, on the 8080, there was an instruction to read or write a specified I/O address 0x00-0xFF, but there was no instruction to read-or write an I/O address specified in a register. It was common for language interpreters, if user code did something like "out 123,45", to store into three bytes of memory the instructions "out 7Bh/ret", load the accumulator with 2Dh, and make a call to the first of those instructions. In that situation, the interpreter would indeed be producing a machine code instruction to execute the interpreted instruction. Such code generation, however, was mostly limited to things like IN and OUT instructions.
Many common Microsoft BASIC interpreters for the 6502 (and perhaps the 8080 as well) made somewhat more extensive use of code stored in RAM, but the code that was stored in RAM not not significantly depend upon the program that was executing; the majority of the RAM routine would not change during program execution, but the address of the next instruction was kept in-line as part of the routine allowing the use of an absolute-mode "LDA" instruction, which saved at least one cycle off every byte fetch.

Simulating multiple instances of an embedded processor

I'm working on a project which will entail multiple devices, each with an embedded (ARM) processor, communicating. One development approach which I have found useful in the past with projects that only entailed a single embedded processor was develop the code using Visual Studio, divided into three portions:
Main application code (in unmanaged C/C++ [see note])
I/O-simulating code (C/C++) that runs under Visual Studio
Embedded I/O code (C), which Visual Studio is instructed not to build, runs on the target system. Previously this code was for the PIC; for most future projects I'm migrating to the ARM.
Feeding the embedded compiler/linker the code from parts 1 and 3 yields a hex file that can run on the target system. Running parts 1 and 2 together yields code which can run on the PC, with the benefit of better debugging tools and more precise control over I/O behavior (e.g. I can make the simulation code introduce certain types of random hiccups more easily than I can induce controlled hiccups on real hardware).
Target code is written in C, but the simulation environment uses C++ so as to simulate I/O registers. For example, I have a PortArray data structure; the header file for the embedded compiler includes a line like unsigned char LATA # 0xF89; and my header file for simulation includes #define LATA _IOBIT(f89,1) which in turn invokes a macro that accesses a suitable property of an I/O object, so a statement like LATA |= 4; will read the simulated latch, "or" the read value with 4, and write the new value. To make this work, the target code has to compile under C++ as well as under C, but this mostly isn't a problem. The biggest annoyance is probably with enum types (which behave as integers in C, but have to be coaxed to do so in C++).
Previously, I've used two approaches to making the simulation interactive:
Compile and link a DLL with target-application and simulation code, and have VB code in the same project which interacts with it.
Compile the target-application code and some simulation code to an EXE with instance of Visual Studio, and use a second instance of Visual Studio for the simulation-UI. Have the two programs communicate via TCP, so nearly all "real" I/O logic is in the simulation program. For example, the aforementioned `LATA |= 4;` would send a "read port 0xF89" command to the TCP port, get the response, process the received value, and send a "write port 0xF89" command with the result.
I've found the latter approach to run a tiny bit slower than the former in some cases, but it seems much more convenient for debugging, since I can suspend execution of the unmanaged simulation code while the simulation UI remains responsive. Indeed, for simulating a single target device at a time, I think the latter approach works extremely well. My question is how I should best go about simulating a plurality of target devices (e.g. 16 of them).
The difficulty I have is figuring out how to make each simulated instance get its own set of global variables. If I were to compile to an EXE and run one instance of the EXE for each simulated target device, that would work, but I don't know any practical way to maintain debugger support while doing that. Another approach would be to arrange the target code so that everything would compile as one module joined together via #include. For simulation purposes, everything could then be wrapped into a single C++ class, with global variables turning into class-instance variables. That would be a bit more object-oriented, but I really don't like the idea of forcing all the application code to live in one compiled and linked module.
What would perhaps be ideal would be if the code could load multiple instances of the DLL, each with its own set of global variables. I have no idea how to do that, however, nor do I know how to make things interact with the debugger. I don't think it's really necessary that all simulated target devices actually execute code simultaneously; it would be perfectly acceptable for simulation instances to use cooperative multitasking. If there were some way of finding out what range of memory holds the global variables, it might be possible to have the 'task-switch' method swap out all of the global variables used by the previously-running instance and swap in the contents applicable to the instance being switched in. Although I'd know how to do that in an embedded context, though, I'd have no idea how to do that on the PC.
Edit
My questions would be:
Is there any nicer way to allow simulation logic to be paused and examined in VS2010 debugger, while keeping a responsive UI for the simulator front-end, than running the simulator front end and the simulator logic in separate instances of VS2010, if the simulation logic must be written in C and the simulation front end in managed code? For example, is there a way to tell the debugger that when a breakpoint is hit, some or all other threads should be allowed to keep running while the thread that had hit the breakpoint sits paused?
If the bulk of the simulation logic must be source-code compatible with an embedded system written in C (so that the same source files can be compiled and run for simulation purposes under VS2010, and then compiled by the embedded-systems compiler for use in real hardware), is there any way to have the VS2010 debugger interact with multiple simulated instances of the embedded device? Assume performance is not likely to be an issue, but the number of instances will be large enough that creating a separate project for each instance would be likely be annoying in the absence of any way to automate the process. I can think of three somewhat-workable approaches, but don't know how to make any of them work really nicely. There's also an approach which would be better if it's possible, but I don't know how to make it work.
Wrap all the simulation code within a single C++ class, such that what would be global variables in the target system become class members. I'm leaning toward this approach, but it would seem to require everything to be compiled as a single module, which would annoyingly affect the design of the target system code. Is there any nice way to have code access class instance members as though they were globals, without requiring all functions using such instances to be members of the same module?
Compile a separate DLL for each simulated instance (so that e.g. if I want to run up to 16 instances, I would include 16 DLL's in the project, all sharing the same source files). This could work, but every change to the project configuration would have to be repeated 16 times. Really ugly.
Compile the simulation logic to an EXE, and run an appropriate number of instances of that EXE. This could work, but I don't know of any convenient way to do things like set a breakpoint common to all instances. Is it possible to have multiple running instances of an EXE attached to a single debugger instance?
Load multiple instances of a DLL in such a way that each instance gets its own global variables, while still being accessible in the debugger. This would be nicest if it were possible, but I don't know any way to do so. Is it possible? How? I've never used AppDomains, but my intuition would suggest that might be useful here.
If I use one VS2010 instance for the front-end, and another for the simulation logic, is there any way to arrange things so that starting code in one will automatically launch the code in the other?
I'm not particularly committed to any single simulation approach; while it might be nice to know if there's some way of slightly improving the above, I'd also like to know of any other alternative approaches that could work even better.
I would think that you'd still have to run 16 copies of your main application code, but that your TCP-based I/O simulator could keep a different set of registers/state for each TCP connection that comes in.
Instead of a bunch of global variables, put them into a single structure that encompasses the I/O state of a single device. Either spawn off a new thread for each socket, or just keep a list of active sockets and dedicate a single instance of the state structure for each socket.
the simulators I have seen that handle multiple instances of the instruction set/processor are designed that way. There is a structure usually that contains a complete set of registers, and a new pointer or an array of these structures are used to multiply them into multiple instances of the processor.

how to debug SIGSEGV in jvm GCTaskThread

My application is experiencing cashes in production.
The crash dump indicates a SIGSEGV has occurred in GCTaskThread
It uses JNI, so there might be some source for memory corruption, although I can't be sure.
How can I debug this problem - I though of doing -XX:OnError... but i am not sure what will help me debug this.
Also, can some of you give a concrete example on how JNI code can crash GC with SIGSEGV
EDIT:
OS:SUSE Linux Enterprise Server 10 (x86_64)
vm_info: Java HotSpot(TM) 64-Bit Server VM (11.0-b15) for linux-amd64 JRE (1.6.0_10-b33), built on Sep 26 2008 01:10:29 by "java_re" with gcc 3.2.2 (SuSE Linux)
EDIT:
The issue stop occurring after we disable the hyper threading, any thoughts?
Errors in JNI code can occur in several ways:
The program crashes during execution of a native method (most common).
The program crashes some time after returning from the native method, often during GC (not so common).
Bad JNI code causes deadlocks shortly after returning from a native method (occasional).
If you think that you have a problem with the interaction between user-written native code and the JVM (that is, a JNI problem), you can run diagnostics that help you check the JNI transitions. to invoke these diagnostics; specify the -Xcheck:jni option when you start up the JVM.
The -Xcheck:jni option activates a set of wrapper functions around the JNI functions. The wrapper functions perform checks on the incoming parameters. These checks include:
Whether the call and the call that initialized JNI are on the same thread.
Whether the object parameters are valid objects.
Whether local or global references refer to valid objects.
Whether the type of a field matches the Get<Type>Field or Set<Type>Field call.
Whether static and nonstatic field IDs are valid.
Whether strings are valid and non-null.
Whether array elements are non-null.
The types on array elements.
Pls read the following links
http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index.jsp?topic=/com.ibm.java.doc.diagnostics.50/html/jni_debug.html
http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtq
Use valgrind. This sounds like a memory corruption. The output will be verbose but try to isolate the report to the JNI library if its possible.
Since the faulty thread seems to be GCTaskThread, did you try enabling verbose:gc and analyzing the output (preferably using a graphical tool like samurai, etc.)? Are you able to isolate a specific lib after examining the hs_err file?
Also, can you please provide more information on what causes the issue and if it is easily reproducible?