I have written a simple SUDOKU solver. To roughly test the performance I'm using simple System.currentTimeMillis calls.
I have prepared a set of initial sudoku configuration in text file. The program reads the file and solves each sudoku configuration. When running the tests I have noticed that the first 3-4 solve runs are really slower than the rest and by slower I mean by order of magnitude.
There is sample pseudo-code snippet:
main(){
while(file has lines){
configuration = readLine();
Solver s = new Solver(configuration);
now1 = System.currentTimeMillis();
s.solve();
now2 = System.currentTimeMillis();
System.out.print(now2 - now1);
}
}
I measure only solve() method, so IO is not a problem, I even hardcoded some data into program - still first few slower. The difficulty of puzzle is not an issue as well I have tried different permutations and difficulties of configuration and always the same - first few are slower.
My question is - why is that and is there a way to prevent it?
This is supposed to happen. The JIT compiler optimizes code that gets called more often as your program runs for longer.
This only reflects the general fact that the technique you're using to test performance simply isn't reliable in Java.
In Practice, Methods are not compiled by JIT the first time they are called by JVM. For each method, the JVM maintains a call count, which is incremented every time the method is called. The JVM interprets a method until its call count exceeds a JIT compilation threshold. Therefore, often-used methods are compiled soon after the JVM has started , and less-used methods are compiled much later, or not at all. This JIT compilation Threshold helps the JVM start quickly.
So, the busiest methods of java program are always optimized most
aggresively which increases its execution speed each time it is called by program.
Here is the Source for above information.
In performance testing engagements we'd always run the system being tested for a while to let it reach a steady state. Only then would we start the performance metrics. You might try the same: run the solve() method a number of times before capturing your metrics.
Related
I want to use callgrind to profile my program, but it is slowed down too much. What I want to do is generate a callgraph using kcachegrind where every node shows how much percentage the program spent in which function. Can you tell me which features I can safely disable for better performance so this info is still generated?
Thanks a lot!
Quick Overview
Callgrind is essentially a cache profiler (both instruction and data) that works at function-level granularity in order to reproduce the call graph. The profiler observes actions that trigger events during program execution and updates various aggregate counters maintained by the simulator.
However, this fine-grained simulation of cache events comes at a heavy cost of program runtime. You should know that even with all profiling turned off and no useful data being collected, Callgrind will still have a minimum of about 2-4x hit in runtime. When actively collecting data, it would be an average of 10-20x slower.
Is this theoretical minimum acceptable for your requirement? If not, you should consider other profiling options - discussed here. But if, with some careful control, speeding up large, uninteresting chunks of your program to only a 2-4x slowdown sounds reasonable, read on!
Available Hooks
Callgrind offers 2 forms of control over the collection of profiling data. It's important to understand their inter-dependencies in order to make an informed choice:
Intrumentation state - When disabled, no program actions are observed and thus, no events are triggered or collected. The simulator basically switches to an 'idle' state; this is what helps you achieve the theoretical 2-4x minimum I mentioned above (see Nulgrind).
But be warned, this should be used carefully! While it offers attractive benefits, this can have non-trivial effects on accuracy. From the documentation:
However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time.
Collection state - When disabled, the aggregate counters are not updated with triggered events. This provides a way to streamline collected data to only the interesting parts of your call stack.
However, intuitively, this does not offer any noticeable speedup in execution time. And of course, instrumentation needs to be switched on for collection to be enabled.
Commands
valgrind --tool=callgrind
--instr-atstart=<yes|no> ;; default = yes
--collect-atstart=<yes|no> ;; default = yes
--toggle-collect=<function> ;; Toggle collection at entry/exit of specific function
<PROGRAM> <PROGRAM_OPTIONS>
Instrumentation - Turning this off in the beginning indicates you have to turn it back on again at the appropriate time. 2 alternate ways to do this:
During program execution, use the following command from the shell at the appropriate time.
callgrind_control -i <on|off>
This would require visibility into your program execution as well as some tolerance in accuracy due to the latency of deploying the command. You could use a few shell tricks to help, of course.
Insert the following macros into your program code and recompile your binary.
CALLGRIND_START_INSTRUMENTATION;
CALLGRIND_STOP_INSTRUMENTATION;
Collection - Similarly, if disabled at the start, collection needs to be toggled around the interesting parts of the code. 2 alternate ways to do this:
Use the --toggle-collect=<function> flag during launch. By definition, this would be inclusive of all the sub-calls within this function. If you can thus identify a particular parent function as your bottleneck, this can be a useful method to isolate relevant data and keep the generated call graph minimal.
Tip: Wildcards are supported in the function name!
Use the following macro before and after the relevant portion of your program code and recompile your binary. This can give you more fine-grained control within functions.
CALLGRIND_TOGGLE_COLLECT;
Summary
To combine all the ideas above, a good approach would be:
#include <callgrind.h>
// Uninteresting program chunk
CALLGRIND_START_INSTRUMENTATION;
// A few extra lines to allow cache warm-up
CALLGRIND_TOGGLE_COLLECT;
// Portion to profile
CALLGRIND_TOGGLE_COLLECT;
CALLGRIND_DUMP_STATS;
CALLGRIND_STOP_INSTRUMENTATION;
// Rest of the program
Recompile, and launch Callgrind with:
valgrind --tool=callgrind --instr-atstart=no --collect-atstart=no <PROGRAM> <PROGRAM_OPTIONS>
Note that there will be 2 Callgrind output files generated by this method - the first created by the DUMP_STATS macro, and the second at program exit. DUMP_STATS zeroes all counters after use, which means the second log will report 0 events.
Within the active instrumentation block, you could also toggle collection multiple times and dump collected stats for each chunk.
Recently I started using PITest for Mutation Testing. Post building my project using maven when I run the command mvn org.pitest:pitest-maven:mutationCoverage I get this error bunch of times:
-stderr : objc[2787]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be ustderr : sed. Which one is undefined.
Sometimes the error is followed by
PIT >> WARNING : Slave exited abnormally due to MEMORY_ERROR
or PIT >> WARNING : Slave exited abnormally due to TIMED_OUT
I use OsX version 10.10.4 and Java 8 (jdk1.8.0_74).
Any fix/ work-around for this?
Don't worry about this;
-stderr : objc[2787]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be ustderr : sed. Which one is undefined.
This is just for information that there are two implementations of JavaLauncherHelper and the message tells you that one of the two will use std-err output stream but it is undetermined which one of the two. It is a known isse, see also this question
The other two are a result of what PIT is doing: it's modifying the byte code and it may happen that this not just affects the output of an operation (detected by a test) but actually affects the runtime behavior. For example if boundaries of a loop get changed that way, that the loop runs endlessly. Pit is capable of detecting this and prints out an error. Mutations detected by either a memory error or a timeout error can be considered as "killed". But you should check each of those individually as they could be false positives, too.
PIT >> WARNING : Slave exited abnormally due to MEMORY_ERROR
means the modified code produces more or larger objects so the forked jvm runs out of memory. Imagine a loop like this
while(a < b){
list.add(new Object());
a++;
}
And the a++ gets altered to a--. The loop may eventually end, but it's more likely you run out of memory before that.
From the documentation
A memory error might occur as a result of a mutation that increases the amount of memory used by the system, or may be the result of the additional memory overhead required to repeatedly run your tests in the presence of mutations. If you see a large number of memory errors consider configuring more heap and permgen space for the tests.
The timeout issue is similar to this, the reason coud be either, that you run an infinite loop or that system thinks you run an infinite loop, i.e. when the system is too slow to compute the altered code. If you experience a lot of timeouts you should consider increasing the timeout value. But be carefull, as this may impact the overall execution time.
From the FAQ
Timeouts when running mutation tests are caused by one of two things
1 A mutation that causes an infinite loop
2 PIT thinking an infinite loop has occured but being wrong
In order to detect infinite loops PIT measures the normal execution time of each test without any mutations present. When the test is run in the presence of a mutation PIT checks that the test doesn’t run for any longer than
normal time * x + y
Unfortunately the real world is more complex than this.
Test times can vary due to the order in which the tests are run. The first test in a class may have a execution time much higher than the others as the JVM will need to load the classes required for that test. This can be particularly pronounced in code that uses XML binding frameworks such as jaxb where classloading may take several seconds.
When PIT runs the tests against a mutation the order of the tests will be different. Tests that previously took miliseconds may now take seconds as they now carry the overhead of classloading. PIT may therefore incorrectly flag the mutation as causing an infinite loop.
An fix for this issue may be developed in a future version of PIT. In the meantime if you encounter a large number of timeouts, try increasing y in the equations above to a large value with –timeoutConst (timeoutConstant in maven).
I'm doing some work on profiling the behavior of programs. One thing I would like to do is get the amount of time that a process has run on the CPU. I am accomplishing this by reading the sum_exec_runtime field in the Linux kernel's sched_entity data structure.
After testing this with some fairly simple programs which simply execute a loop and then exit, I am running into a peculiar issue, being that the program does not finish with the same runtime each time it is executed. Seeing as sum_exec_runtime is a value represented in nanoseconds, I would expect the value to differ within a few microseconds. However, I am seeing variations of several milliseconds.
My initial reaction was that this could be due to I/O waiting times, however it is my understanding that the process should give up the CPU while waiting for I/O. Furthermore, my test programs are simply executing loops, so there should be very little to no I/O.
I am seeking any advice on the following:
Is sum_exec_runtime not the actual time that a process has had control of the CPU?
Does the process not actually give up the CPU while waiting for I/O?
Are there other factors that could affect the actual runtime of a process (besides I/O)?
Keep in mind, I am only trying to find the actual time that the process spent executing on the CPU. I do not care about the total execution time including sleeping or waiting to run.
Edit: I also want to make clear that there are no branches in my test program aside from the loop, which simply loops for a constant number of iterations.
Thanks.
Your question is really broad, but you can incur context switches for various reasons. Calling most system calls involves at least one context switch. Page faults cause contexts switches. Exceeding your time slice causes a context switch.
sum_exec_runtime is equal to utime + stime from /proc/$PID/stat, but sum_exec_runtime is measured in nanoseconds. It sounds like you only care about utime which is the time your process has been scheduled in user mode. See proc(5) for more details.
You can look at nr_switches both voluntary and involuntary which are also part of sched_entity. That will probably account for most variation, but I would not expect successive runs to be identical. The exact time that you get for each run will be affected by all of the other processes running on the system.
You'll also be affected by the amount of file system cache used on your system and how many file system cache hits you get in successive runs if you are doing any IO at all.
To give a very concrete and obvious example of how other processes can affect the run time of the current process, think about if you are exceeding your physical RAM constraints. If your program asks for more RAM, then the kernel is going to spend more time swapping. That time swapping will be accounted in stime but will vary depending on how much RAM you need and how much RAM is available. There are lot's of other ways that other processes can affect your process's run time. This is just one example.
To answer your 3 points:
sum_exec_runtime is the actual time the scheduler ran the process including system time
If you count switching to the kernel as the process giving up the CPU, then yes, but it does not necessarily mean a different user process may get the CPU back once the kernel is done.
I think I've already answered this question that there are lot's of factors.
I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers
It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.
i profile a big jboss server with a lot of classes in it. When i profile the CPU the result is always something like java.util.TimerThread.run() = 62% and java.util.concurrent.ThreadPoolExecutor$Worker.run() = 34,8%.
Under these two methods thousand other methods have 0%.
I think thats a bad bug, because most of these methods run in these Threads. But how can i see which one...
The ThreadDump - Function isnt usefull for this too.
If you don't know which part of the code is slow, it is better to start with CPU sampling. Once you know better (based on the sampling results) what is wrong, you can profile just part of your jboss server. See Profiling With VisualVM, Part 1 and Profiling With VisualVM, Part 2 to get more information about profiling and how to set profiling roots and instrumentation filter.