Recently I started using PITest for Mutation Testing. Post building my project using maven when I run the command mvn org.pitest:pitest-maven:mutationCoverage I get this error bunch of times:
-stderr : objc[2787]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be ustderr : sed. Which one is undefined.
Sometimes the error is followed by
PIT >> WARNING : Slave exited abnormally due to MEMORY_ERROR
or PIT >> WARNING : Slave exited abnormally due to TIMED_OUT
I use OsX version 10.10.4 and Java 8 (jdk1.8.0_74).
Any fix/ work-around for this?
Don't worry about this;
-stderr : objc[2787]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be ustderr : sed. Which one is undefined.
This is just for information that there are two implementations of JavaLauncherHelper and the message tells you that one of the two will use std-err output stream but it is undetermined which one of the two. It is a known isse, see also this question
The other two are a result of what PIT is doing: it's modifying the byte code and it may happen that this not just affects the output of an operation (detected by a test) but actually affects the runtime behavior. For example if boundaries of a loop get changed that way, that the loop runs endlessly. Pit is capable of detecting this and prints out an error. Mutations detected by either a memory error or a timeout error can be considered as "killed". But you should check each of those individually as they could be false positives, too.
PIT >> WARNING : Slave exited abnormally due to MEMORY_ERROR
means the modified code produces more or larger objects so the forked jvm runs out of memory. Imagine a loop like this
while(a < b){
list.add(new Object());
a++;
}
And the a++ gets altered to a--. The loop may eventually end, but it's more likely you run out of memory before that.
From the documentation
A memory error might occur as a result of a mutation that increases the amount of memory used by the system, or may be the result of the additional memory overhead required to repeatedly run your tests in the presence of mutations. If you see a large number of memory errors consider configuring more heap and permgen space for the tests.
The timeout issue is similar to this, the reason coud be either, that you run an infinite loop or that system thinks you run an infinite loop, i.e. when the system is too slow to compute the altered code. If you experience a lot of timeouts you should consider increasing the timeout value. But be carefull, as this may impact the overall execution time.
From the FAQ
Timeouts when running mutation tests are caused by one of two things
1 A mutation that causes an infinite loop
2 PIT thinking an infinite loop has occured but being wrong
In order to detect infinite loops PIT measures the normal execution time of each test without any mutations present. When the test is run in the presence of a mutation PIT checks that the test doesn’t run for any longer than
normal time * x + y
Unfortunately the real world is more complex than this.
Test times can vary due to the order in which the tests are run. The first test in a class may have a execution time much higher than the others as the JVM will need to load the classes required for that test. This can be particularly pronounced in code that uses XML binding frameworks such as jaxb where classloading may take several seconds.
When PIT runs the tests against a mutation the order of the tests will be different. Tests that previously took miliseconds may now take seconds as they now carry the overhead of classloading. PIT may therefore incorrectly flag the mutation as causing an infinite loop.
An fix for this issue may be developed in a future version of PIT. In the meantime if you encounter a large number of timeouts, try increasing y in the equations above to a large value with –timeoutConst (timeoutConstant in maven).
Related
I want to use callgrind to profile my program, but it is slowed down too much. What I want to do is generate a callgraph using kcachegrind where every node shows how much percentage the program spent in which function. Can you tell me which features I can safely disable for better performance so this info is still generated?
Thanks a lot!
Quick Overview
Callgrind is essentially a cache profiler (both instruction and data) that works at function-level granularity in order to reproduce the call graph. The profiler observes actions that trigger events during program execution and updates various aggregate counters maintained by the simulator.
However, this fine-grained simulation of cache events comes at a heavy cost of program runtime. You should know that even with all profiling turned off and no useful data being collected, Callgrind will still have a minimum of about 2-4x hit in runtime. When actively collecting data, it would be an average of 10-20x slower.
Is this theoretical minimum acceptable for your requirement? If not, you should consider other profiling options - discussed here. But if, with some careful control, speeding up large, uninteresting chunks of your program to only a 2-4x slowdown sounds reasonable, read on!
Available Hooks
Callgrind offers 2 forms of control over the collection of profiling data. It's important to understand their inter-dependencies in order to make an informed choice:
Intrumentation state - When disabled, no program actions are observed and thus, no events are triggered or collected. The simulator basically switches to an 'idle' state; this is what helps you achieve the theoretical 2-4x minimum I mentioned above (see Nulgrind).
But be warned, this should be used carefully! While it offers attractive benefits, this can have non-trivial effects on accuracy. From the documentation:
However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time.
Collection state - When disabled, the aggregate counters are not updated with triggered events. This provides a way to streamline collected data to only the interesting parts of your call stack.
However, intuitively, this does not offer any noticeable speedup in execution time. And of course, instrumentation needs to be switched on for collection to be enabled.
Commands
valgrind --tool=callgrind
--instr-atstart=<yes|no> ;; default = yes
--collect-atstart=<yes|no> ;; default = yes
--toggle-collect=<function> ;; Toggle collection at entry/exit of specific function
<PROGRAM> <PROGRAM_OPTIONS>
Instrumentation - Turning this off in the beginning indicates you have to turn it back on again at the appropriate time. 2 alternate ways to do this:
During program execution, use the following command from the shell at the appropriate time.
callgrind_control -i <on|off>
This would require visibility into your program execution as well as some tolerance in accuracy due to the latency of deploying the command. You could use a few shell tricks to help, of course.
Insert the following macros into your program code and recompile your binary.
CALLGRIND_START_INSTRUMENTATION;
CALLGRIND_STOP_INSTRUMENTATION;
Collection - Similarly, if disabled at the start, collection needs to be toggled around the interesting parts of the code. 2 alternate ways to do this:
Use the --toggle-collect=<function> flag during launch. By definition, this would be inclusive of all the sub-calls within this function. If you can thus identify a particular parent function as your bottleneck, this can be a useful method to isolate relevant data and keep the generated call graph minimal.
Tip: Wildcards are supported in the function name!
Use the following macro before and after the relevant portion of your program code and recompile your binary. This can give you more fine-grained control within functions.
CALLGRIND_TOGGLE_COLLECT;
Summary
To combine all the ideas above, a good approach would be:
#include <callgrind.h>
// Uninteresting program chunk
CALLGRIND_START_INSTRUMENTATION;
// A few extra lines to allow cache warm-up
CALLGRIND_TOGGLE_COLLECT;
// Portion to profile
CALLGRIND_TOGGLE_COLLECT;
CALLGRIND_DUMP_STATS;
CALLGRIND_STOP_INSTRUMENTATION;
// Rest of the program
Recompile, and launch Callgrind with:
valgrind --tool=callgrind --instr-atstart=no --collect-atstart=no <PROGRAM> <PROGRAM_OPTIONS>
Note that there will be 2 Callgrind output files generated by this method - the first created by the DUMP_STATS macro, and the second at program exit. DUMP_STATS zeroes all counters after use, which means the second log will report 0 events.
Within the active instrumentation block, you could also toggle collection multiple times and dump collected stats for each chunk.
In our app running on Jdk 8 we use VisualVM to track the usage of loaded classes and the usage of the metaspace.
At some point in time while our app is running we see that the number of loaded classes don't increase any more but the metaspace still increases in it's size while our program is running. So what else apart from classes is stored in metaspace, that could cause that?
While your program is running, some parts of your code may be determined as "hot" by HotSpot's JIT compiler. This will cause those parts to be transformed/compiled to native code, and also some other code may be inlined into it. This native code representation has to go somewhere, and it goes into the same place as other class metadata - the Metaspace.
It explains continuous growth you're seeing: hot parts are determined over time using a simple metric of how much times did that piece of code got executed. Over time more and more code pieces will be JIT'ed as they'll hit threshold set by -XX:CompileThreshold (defaults to 10000)
i am not sure but hier (http://java.dzone.com/articles/java-8-permgen-metaspace) i fund this
Garbage collection of the dead classes and classloaders is triggered once the class metadata usage reaches the “MaxMetaspaceSize”.
maybe is this the cause for increasing metaspace size.
I'm doing some work on profiling the behavior of programs. One thing I would like to do is get the amount of time that a process has run on the CPU. I am accomplishing this by reading the sum_exec_runtime field in the Linux kernel's sched_entity data structure.
After testing this with some fairly simple programs which simply execute a loop and then exit, I am running into a peculiar issue, being that the program does not finish with the same runtime each time it is executed. Seeing as sum_exec_runtime is a value represented in nanoseconds, I would expect the value to differ within a few microseconds. However, I am seeing variations of several milliseconds.
My initial reaction was that this could be due to I/O waiting times, however it is my understanding that the process should give up the CPU while waiting for I/O. Furthermore, my test programs are simply executing loops, so there should be very little to no I/O.
I am seeking any advice on the following:
Is sum_exec_runtime not the actual time that a process has had control of the CPU?
Does the process not actually give up the CPU while waiting for I/O?
Are there other factors that could affect the actual runtime of a process (besides I/O)?
Keep in mind, I am only trying to find the actual time that the process spent executing on the CPU. I do not care about the total execution time including sleeping or waiting to run.
Edit: I also want to make clear that there are no branches in my test program aside from the loop, which simply loops for a constant number of iterations.
Thanks.
Your question is really broad, but you can incur context switches for various reasons. Calling most system calls involves at least one context switch. Page faults cause contexts switches. Exceeding your time slice causes a context switch.
sum_exec_runtime is equal to utime + stime from /proc/$PID/stat, but sum_exec_runtime is measured in nanoseconds. It sounds like you only care about utime which is the time your process has been scheduled in user mode. See proc(5) for more details.
You can look at nr_switches both voluntary and involuntary which are also part of sched_entity. That will probably account for most variation, but I would not expect successive runs to be identical. The exact time that you get for each run will be affected by all of the other processes running on the system.
You'll also be affected by the amount of file system cache used on your system and how many file system cache hits you get in successive runs if you are doing any IO at all.
To give a very concrete and obvious example of how other processes can affect the run time of the current process, think about if you are exceeding your physical RAM constraints. If your program asks for more RAM, then the kernel is going to spend more time swapping. That time swapping will be accounted in stime but will vary depending on how much RAM you need and how much RAM is available. There are lot's of other ways that other processes can affect your process's run time. This is just one example.
To answer your 3 points:
sum_exec_runtime is the actual time the scheduler ran the process including system time
If you count switching to the kernel as the process giving up the CPU, then yes, but it does not necessarily mean a different user process may get the CPU back once the kernel is done.
I think I've already answered this question that there are lot's of factors.
I have written a simple SUDOKU solver. To roughly test the performance I'm using simple System.currentTimeMillis calls.
I have prepared a set of initial sudoku configuration in text file. The program reads the file and solves each sudoku configuration. When running the tests I have noticed that the first 3-4 solve runs are really slower than the rest and by slower I mean by order of magnitude.
There is sample pseudo-code snippet:
main(){
while(file has lines){
configuration = readLine();
Solver s = new Solver(configuration);
now1 = System.currentTimeMillis();
s.solve();
now2 = System.currentTimeMillis();
System.out.print(now2 - now1);
}
}
I measure only solve() method, so IO is not a problem, I even hardcoded some data into program - still first few slower. The difficulty of puzzle is not an issue as well I have tried different permutations and difficulties of configuration and always the same - first few are slower.
My question is - why is that and is there a way to prevent it?
This is supposed to happen. The JIT compiler optimizes code that gets called more often as your program runs for longer.
This only reflects the general fact that the technique you're using to test performance simply isn't reliable in Java.
In Practice, Methods are not compiled by JIT the first time they are called by JVM. For each method, the JVM maintains a call count, which is incremented every time the method is called. The JVM interprets a method until its call count exceeds a JIT compilation threshold. Therefore, often-used methods are compiled soon after the JVM has started , and less-used methods are compiled much later, or not at all. This JIT compilation Threshold helps the JVM start quickly.
So, the busiest methods of java program are always optimized most
aggresively which increases its execution speed each time it is called by program.
Here is the Source for above information.
In performance testing engagements we'd always run the system being tested for a while to let it reach a steady state. Only then would we start the performance metrics. You might try the same: run the solve() method a number of times before capturing your metrics.
I have some cuda code running through some FFTs and other math operations, which works on blocks of 2^n as requested by the user. The code works well when first run, but after running long enough it starts to fail. Eventually it will get to the point where if I run any block size larger then 2^ll I get no data back (all zeros). I've done some testing by modifying the kernel code and from what I can tell the kernel is not executing. I'm trying to figure out why my code stops producing data after multiple iterations on large block sizes.
The issue looks at first glance to be memory leak. I know I have to run multiple iterations of the processing to cause an error. At first only large block sizes will stop working, but as I run more iterations smaller block sizes will start to fail as well. The reason I'm not certain the issue is memory is that my code will work for a block size lower then 2^11 regardless of how many iterations I run. If this was a simple memory leak I would have expected the symptoms to get progressively worse until I couldn't access any memory on the card.
I've also noticed that larger block sizes (roughly equivalent to the amount of memory each thread uses) tend to cause the program to fail sooner. Increasing the number of blocks processed (ie number of Cuda threads) doesn't seem to have an affect on when the code starts to fail.
as far as I can tell no error code is being returned, the kernel doesn't appear to execute at all.
Can anyone suggest what my be causing this issue? I would settle for any insight in how to debug code running on the GPU or to monitor GPU memory availability.
If you need more computation done, bump up your grid size and not your thread block size. To quote the CUDA programming guide 3.0 on pg. 8, "On current GPUs, a thread block may contain up to 512
threads."
This means that threadIdx.x * threadIdx.y * threadIdx.z <= 512 at all times. If you maintain that invariant, do things work?