How to control the exact number of tests to run with caliper - caliper

I tried to understand, what is the proper way to control the number of runs: is it the trial or rep? It is confusing: I run the benchmark with --trial 1 and recieve the output:
0% Scenario{vm=java, trial=0, benchmark=SendPublisher} 1002183670.00 ns; Ï=315184.24 ns # 3 trials
It looks like 3 trials were run. What are that trials? What are reps? I can control the rep value with the options --debug & --debug-reps, but what is the value when running w/o debug? I need to know how many times exactly my tested method was called.

Between Caliper 0.5 and 1.0 a bit of the terminology has changed, but this should explain it for both. Keep in mind than things were a little murky in 0.5, so most of the changes made for 1.0 were to make things clearer and more precise.
Caliper 0.5
A single invocation of Caliper is a run. Each run has some number of trials, which is just another iteration of all of the work performed in a run. Within each trial, Caliper executes some number of scenarios. A scenario is the combination of VM, benchmark, etc. The runtime of a scenario is measured by timing the execution of some number of reps, which is the number passed to your benchmark method at runtime. Multiple reps are, of course, necessary because it would be impossible to get precise measurements for a single invocation in a microbenchmark.
Caliper 1.0
Caliper 1.0 follows a pretty similar model. A single invocation of Caliper is still a run. Each run consists of some number of trials, but a trial is more precisely defined as an invocation of a scenario measured with an instrument.
A scenario is roughly defined as what you're measuring (host, VM, benchmark, parameters) and the instrument is what code performs the measurement and how it was configured. The idea being that if a perfectly repeatable benchmark were a function of the form f(x)=y, Caliper would be defined as instrument(scenario)=measurements.
Within the execution of the runtime instrument (it's similar for others), there is still the same notion of reps, which is the number of iterations passed to the benchmark code. You can't control the rep value directly since each instrument will perform its own calculation to determine what it should be.
At runtime, Caliper plans its execution by calculating some number of experiments, which is the combination of instrument, benchmark, VM and parameters. Each experiment is run --trials number of times and reported as an individual trial with its own ID.
How to use the reps parameter
Traditionally, the best way to use the reps parameter is to include a loop in your benchmark code that looks more or less like:
for (int i = 0; i < reps; i++) {…}
This is the most direct way to ensure that the number of reps scales linearly with the reported runtime. That is a necessary property because Caliper is attempting to infer the cost of a single, uniform operation based on the aggregate cost of many. If runtime isn't linear with the number of reps, the results will be invalid. This also implies that reps should not be passed directly to the benchmarked code.

Related

Raku parallel/functional methods

I am pretty new to Raku and I have a questions to functional methods, in particular with reduce.
I originally had the method:
sub standardab{
my $mittel = mittel(#_);
my $foo = 0;
for #_ {
$foo += ($_ - $mittel)**2;
}
$foo = sqrt($foo/(#_.elems));
}
and it worked fine. Then I started to use reduce:
sub standardab{
my $mittel = mittel(#_);
my $foo = 0;
$foo = #_.reduce({$^a + ($^b-$mittel)**2});
$foo = sqrt($foo/(#_.elems));
}
my execution time doubled (I am applying this to roughly 1000 elements) and the solution differed by 0.004 (i guess rounding error).
If I am using
.race.reduce(...)
my execution time is 4 times higher than with the original sequential code.
Can someone tell me the reason for this?
I thought about parallelism initialization time, but - as I said - i am applying this to 1000 elements and if i change other for loops in my code to reduce it gets even slower!
Thanks for your help
Summary
In general, reduce and for do different things, and they are doing different things in your code. For example, compared with your for code, your reduce code involves twice as many arguments being passed and is doing one less iteration. I think that's likely at the root of the 0.004 difference.
Even if your for and reduce code did the same thing, an optimized version of such reduce code would never be faster than an equally optimized version of equivalent for code.
I thought that race didn't automatically parallelize reduce due to reduce's nature. (Though I see per your and #user0721090601's comment I'm wrong.) But it will incur overhead -- currently a lot.
You could use race to parallelize your for loop instead, if it's slightly rewritten. That might speed it up.
On the difference between your for and reduce code
Here's the difference I meant:
say do for <a b c d> { $^a } # (a b c d) (4 iterations)
say do reduce <a b c d>: { $^a, $^b } # (((a b) c) d) (3 iterations)
For more details of their operation, see their respective doc (for, reduce).
You haven't shared your data, but I will presume that the for and/or reduce computations involve Nums (floats). Addition of floats isn't commutative, so you may well get (typically small) discrepancies if the additions end up happening in a different order.
I presume that explains the 0.004 difference.
On your sequential reduce being 2X slower than your for
my execution time doubled (I am applying this to roughly 1000 elements)
First, your reduce code is different, as explained above. There are general abstract differences (eg taking two arguments per call instead of your for block's one) and perhaps your specific data leads to fundamental numeric computation differences (perhaps your for loop computation is primarily integer or float math while your reduce is primarily rational?). That might explain the execution time difference, or some of it.
Another part of it may be the difference between, on the one hand, a reduce, which will by default compile into calls of a closure, with call overhead, and two arguments per call, and temporary memory storing intermediate results, and, on the other, a for which will by default compile into direct iteration, with the {...} being just inlined code rather than a call of a closure. (That said, it's possible a reduce will sometimes compile to inlined code; and it may even already be that way for your code.)
More generally, Rakudo optimization effort is still in its relatively early days. Most of it has been generic, speeding up all code. Where effort has been applied to particular constructs, the most widely used constructs have gotten the attention so far, and for is widely used and reduce less so. So some or all the difference may just be that reduce is poorly optimized.
On reduce with race
my execution time [for .race.reduce(...)] is 4 times higher than with the original sequential code
I didn't think reduce would be automatically parallelizable with race. Per its doc, reduce works by "iteratively applying a function which knows how to combine two values", and one argument in each iteration is the result of the previous iteration. So it seemed to me it must be done sequentially.
(I see in the comments that I'm misunderstanding what could be done by a compiler with a reduction. Perhaps this is if it's a commutative operation?)
In summary, your code is incurring raceing's overhead without gaining any benefit.
On race in general
Let's say you're using some operation that is parallelizable with race.
First, as you noted, race incurs overhead. There'll be an initialization and teardown cost, at least some of which is paid repeatedly for each evaluation of an overall statement/expression that's being raced.
Second, at least for now, race means use of threads running on CPU cores. For some payloads that can yield a useful benefit despite any initialization and teardown costs. But it will, at best, be a speed up equal to the number of cores.
(One day it should be possible for compiler implementors to spot that a raced for loop is simple enough to be run on a GPU rather than a CPU, and go ahead and send it to a GPU to achieve a spectacular speed up.)
Third, if you literally write .race.foo... you'll get default settings for some tunable aspects of the racing. The defaults are almost certainly not optimal and may be way off.
The currently tunable settings are :batch and :degree. See their doc for more details.
More generally, whether parallelization speeds up code depends on the details of a specific use case such as the data and hardware in use.
On using race with for
If you rewrite your code a bit you can race your for:
$foo = sum do race for #_ { ($_ - $mittel)**2 }
To apply tuning you must repeat the race as a method, for example:
$foo = sum do race for #_.race(:degree(8)) { ($_ - $mittel)**2 }

Determining a program's execution time by its length in bits?

This is a question popped into my mind while reading the halting problem, collatz conjecture and Kolmogorov complexity. I have tried to search for something similar but I was unable to find a particular topic maybe because it is not of great value or it could just be a trivial question.
For the sake of simplicity I will give three examples of programs/functions.
function one(s):
return s
function two(s):
while (True):
print s
function three(s):
for i from 0 to 10^10:
print(s)
So my questions is, if there is a way to formalize the length of a program (like the bits used to describe it) and also the internal memory used by the program, to determine the minimum/maximum number of time/steps needed to decide whether the program will terminate or run forever.
For example, in the first function the program doesn't alter its internal memory and halts after some time steps.
In the second example, the program runs forever but the program also doesn't alter its internal memory. For example, if we considered all the programs with the same length as with the program two that do not alter their state, couldn't we determine an upper bound of steps, which if surpassed we could conclude that this program will never terminate ? (If not why ?)
On the last example, the program alters its state (variable i). So, at each step the upper bound may change.
[In short]
Kolmogorov complexity suggests a way of finding the (descriptive) complexity of an object such as a piece of text. I would like to know, given a formal way of describing the memory-space used by a program (computed in runtime), if we could compute a maximum number of steps, which if surpassed would allow us to know whether this program will terminate or run forever.
Finally, I would like to suggest me any source that I might find useful and help me figure out what I am exactly looking for.
Thank you. (sorry for my English, not my native language. I hope I was clear)
If a deterministic Turing machine enters precisely the same configuration twice (which we can detect b keeping a trace of configurations seen so far), then we immediately know the TM will loop forever.
If it known in advance that a deterministic Turing machine cannot possibly use more than some fixed constant amount of its input tape, then the TM must explicitly halt or eventually enter some configuration it has already visited. Suppose the TM can use at most k tape cells, the tape alphabet is T and the set of states is Q. Then there are (|T|+1)^k * |Q| unique configurations (the number of strings over (T union blank) of length k times the number of states) and by the pigeonhole principle we know that a TM that takes that many steps must enter some configuration it has already been to before.
one: because we are given that this function does not use internal memory, we know that it either halts or loops forever.
two: because we are given that this function does not use internal memory, we know that it either halts or loops forever.
three: because we are given that this function only uses a fixed amount of internal memory (like 34 bits) we can tell in fewer than 2^34 iterations of the loop whether the TM will halt or not for any given input s, guaranteed.
Now, knowing how much tape a TM is going to use, or how much memory a program is going to use, is not a problem a TM can solve. But if you have an oracle (like a person who was able to do a proof) that tells you a correct fixed upper bound on memory, then the halting problem is solvable.

Memory efficiency in If statements

I'm thinking more about how much system memory my programs will use nowadays. I'm currently doing A level Computing at college and I know that in most programs the difference will be negligible but I'm wondering if the following actually makes any difference, in any language.
Say I wanted to output "True" or "False" depending on whether a condition is true. Personally, I prefer to do something like this:
Dim result As String
If condition Then
Result = "True"
Else
Result = "False"
EndIf
Console.WriteLine(result)
However, I'm wondering if the following would consume less memory, etc.:
If condition Then
Console.WriteLine("True")
Else
Console.WriteLine("False")
EndIf
Obviously this is a very much simplified example and in most of my cases there is much more to be outputted, and I realise that in most commercial programs these kind of statements are rare, but hopefully you get the principle.
I'm focusing on VB.NET here because that is the language used for the course, but really I would be interested to know how this differs in different programming languages.
The main issue making if's fast or slow is predictability.
Modern CPU's (anything after 2000) use a mechanism called branch prediction.
Read the above link first, then read on below...
Which is faster?
The if statement constitutes a branch, because the CPU needs to decide whether to follow or skip the if part.
If it guesses the branch correctly the jump will execute in 0 or 1 cycle (1 nanosecond on a 1Ghz computer).
If it does not guess the branch correctly the jump will take 50 cycles (give or take) (1/200th of a microsecord).
Therefore to even feel these differences as a human, you'd need to execute the if statement many millions of times.
The two statements above are likely to execute in exactly the same amount of time, because:
assigning a value to a variable takes negligible time; on average less than a single cpu cycle on a multiscalar CPU*.
calling a function with a constant parameter requires the use of an invisible temporary variable; so in all likelihood code A compiles to almost the exact same object code as code B.
*) All current CPU's are multiscalar.
Which consumes less memory
As stated above, both versions need to put the boolean into a variable.
Version A uses an explicit one, declared by you; version B uses an implicit one declared by the compiler.
However version A is guaranteed to only have one call to the function WriteLine.
Whilst version B may (or may not) have two calls to the function WriteLine.
If the optimizer in the compiler is good, code B will be transformed into code A, if it's not it will remain with the redundant calls.
How bad is the waste
The call takes about 10 bytes for the assignment of the string (Unicode 2 bytes per char).
But so does the other version, so that's the same.
That leaves 5 bytes for a call. Plus maybe a few extra bytes to set up a stackframe.
So lets say due to your totally horrible coding you have now wasted 10 bytes.
Not much to worry about.
From a maintainability point of view
Computer code is written for humans, not machines.
So from that point of view code A is clearly superior.
Imagine not choosing between 2 options -true or false- but 20.
You only call the function once.
If you decide to change the WriteLine for another function you only have to change it in one place, not two or 20.
How to speed this up?
With 2 values it's pretty much impossible, but if you had 20 values you could use a lookup table.
Obviously that optimization is not worth it unless code gets executed many times.
If you need to know the precise amount of memory the instructions are going to take, you can use ildasm on your code, and see for yourself. However, the amount of memory consumed by your code is much less relevant today, when the memory is so cheap and abundant, and compilers are smart enough to see common patterns and reduce the amount of code that they generate.
A much greater concern is readability of your code: if a complex chain of conditions always leads to printing a conditionally set result, your first code block expresses this idea in a cleaner way than the second one does. Everything else being equal, you should prefer whatever form of code that you find the most readable, and let the compiler worry about optimization.
P.S. It goes without saying that Console.WriteLine(condition) would produce the same result, but that is of course not the point of your question.

Parallel Processing in MATLAB with more than 12 cores

I created a function to compute the correct number of ks for a dataset using the Gap Statistics algorithm. This algorithm requires at one point to compute the dispersion (i.e., the sum of the distances between every point and its centroid) for, let's say, 100 different datasets (called "test data(set)" or "reference data(set)"). Since these operations are independent I want to parallel them across all the cores. I have the Mathworks' Parallel Toolbox but I am not sure how to use it (problem 1; I can use past threads to understand this, I guess). However, my real problem is another one: this toolbox seems to allow the usage of just 12 cores (problem 2). My machine has 64 cores and I need to use all of them. Do you know how to parallel a process among 12+ cores?
For your information this is the bit of code that should run in parallel:
%This cycle is repeated n_tests times where n_tests is equal
%to the number of reference datasets we want to use
for id_test = 2:n_tests+1
test_data = generate_test_data(data);
%% Calculate the dispersion(s) for the generated dataset(s)
dispersions(id_test, 1:1:max_k) = zeros;
%We calculate the dispersion for the id_test reference dataset
for id_k = 1:1:max_k
dispersions(id_test, id_k) = calculate_dispersion(test_data, id_k);
end
end
Please note that in R2014a the limit on the number of local workers was removed. See the release notes.
The number of local workers available with Parallel Computing Toolbox is license dependent. When introduced, the limit was 4; this changed to 8 in R2009a; and to 12 in R2011b.
If you want to use 16 workers, you will need a 16-node MDCS licence, and you'll also need to set up some sort of scheduler to manage those. There are detailed instructions about how to do this here:http://www.mathworks.de/support/product/DM/installation/ver_current/. Once you've done that, yes, you'll be able to do "matlabpool open 16".
EDIT: As of Matlab version R2014a there is no longer a limit on the number of local workers for the Parallel Computing Toolbox. That is, if you are using an up-to-date version of Matlab you will not encounter the problem described by the OP.
The fact that matlab creates this restriction on its parallel toolbox make it often not worth the money and effort of using it.
One way of solving is by using a combination of the matlab compiler and virtual machines using either vmware or virtual box.
Compile the code required to run your tests.
Load your compiled code with the MCR(matlab compiler runtime)on a VM template.
Create multiple copies of the VM template, let each template run the required calculations for some of the datasets.
Gather the data of all your results
This method is time consuming and only worth it if it saves more time than porting the code and the code is already highly optimised.
I had the same problem on 32 core machine and 6 datasets. I've overcame this by creating shell script, which started matlab six times, one for each data set. I could do this, becase the computations weren't dependent. From what I understand, You could use similar approach. By starting around 6 instances, each counting around 16 datasets. It depends how much RAM you have and how much each instance consumes.

Rough estimate of test cases

I'm curious how many test cases others have for a site similar to mine. It's your basic CRUD with business workflow website. 3 user roles, a couple input pages, a couple search pages, a business rule engine, etc. Maybe 50k lines of .NET code (workflow and persistence altogether). DB with about 10 main tables plus about 100 supporting tables (lookups, logs, etc.). The main UI for entering data is quite big, around 100 data fields, multiple grids, about 5 action/submit type buttons.
I know this is vague and I'm only hoping for order of magnitude figures. I'm also thinking of basic test cases, not code coverage type cases. But like if I told you we had 25 test cases I'm sure you'd say way WAY not enough. So I'm just looking for ballpark figures.
TIA
I would have as many test cases as it takes to ensure a high level of confidence in the system.
The number of tables, rules, lines of code, etc is actually immaterial.
You should have the appropriate unit tests to ensure your domain objects and business rules are firing correctly. You should have tests to ensure your queries execute appropriately (this is a harder one).
You might even want to have test cases for paths through the software. In other words, click here, get this page, click there, edit a field, save the page, go back... This type is the most difficult as the tests are usually recorded and have to be rerecorded when the pages change (ie: a field is added or removed).
Generally speaking it's more about coverage than number of tests. You want your tests to cover as much of the applications funcionality as is feasible. Note that I didn't say possible. You can cover an entire application (100%) with test cases, but for every little change, bug fix, etc you'll have to recode those tests. This is more desired for a mature app. For newer apps you don't want to hamstring your developers and QA team that way as they'll spend inordinate amounts of time fixing/changing unit tests...
For any system, you could easily spend as much time developing your automated tests as you do the system itself. In some cases, even more.
As for our group, we tend to have lots of unit tests. However, for testing paths through the system we only record those once a particular area has moved into a "maintenance" type of mode. Meaning we expect little change for quite a while in that area and the path test is simply to ensure no one jacked it up.
UPDATE: the comments here led me to the following:
Going a little further: Let's examine 1 small piece of code:
Int32 AddNumbers(Int32 a, Int32 b) {
return a+b;
}
On the face of it you could get away with a single test:
Int32 result = AddNumbers(1,2);
Assert.Equals(result, 3);
However, that probably isn't enough. What happens if you do this:
Int32 result = AddNumbers(Int32.MaxValue, 1);
Assert.Equals(result, (Int32.MaxValue+1));
Now we have a failure. Here's another one:
Int32 result = AddNumbers(Int32.MinValue, -1);
Assert.Equals(result, (Int32.MinValue-1));
So, we have an extremely simple method that requires at least 3 tests. The initial to see if it can give any result, then 2 for bounds checking. That's 3 tests for essentially 2 lines of code (method definition and the one line computation).
As your code becomes more complex, things get really dicey:
Decimal DivideThis(Decimal a, Decimal b) {
result = Decimal.Divide(a,b);
}
This slight change introduces yet another exception condition beyond bounds: DivideByZero. So now we are up to 4 tests required for 2 lines of code.
Now, let's simplify it a bit:
String AppendData(String data, String toAppend) {
return String.Format("{0}{1}", data, toAppend);
}
Our test case here is:
String result = AppendData("Hello", "World");
Assert.Equals(result, "HelloWorld");
That's just one test case for the code block, with no others really needed.
What does this tell us: For starters 2 lines of code might cause us to need between 1 and 4 test cases. You mentioned 50k lines... Using that logic, you will need between 50,000 and 200,000 test cases...
Of course, life is rarely so simple. In those 50k lines of code you have, there are going to be large blocks of code that have very limited inputs. For example a mortgage interest calculator might take 3 parameters, and return 1 value (the APR). The code itself might run 100 lines or so (been awhile, just work with me). The number of test cases for this is going to be determined by edge cases along the lines of making sure you properly handle rounding.
So, let's say it's 5 cases: which brings us to 20 lines of code = 1 case. Calculating that out your 50k lines might result in 2,500 test cases. Obviously much smaller than what we expected above.
Finally, I'm going to throw another wrinkle into the mix. Some test systems can handle inputs and your assertions coming from a data file. Considering our first one we could have a data file that has a line for each parameter combination we want to test. In this scenario, we only need 1 test case to cover 3 (or more..) possible conditions.
The test case might look like (pseudo code):
read input file.
parse expected result, parameter 1, parameter 2
run method
assert method result = parsed result
repeat for each line of the file
With that capability, we are down to 1 test case per scenario. I would say 1 per method, but the reality is that most methods are rarely standalone and it's entirely possible that numerous methods are implicitly tested through explicit testing of others; therefore not requiring their own individual tests.
This leads me to this: It is impossible to determine the right number of test cases without a full understanding of your code base. 5 cases that are at the UI level might be enough for complete coverage depending on the complexity of the tests; or it might take thousands. Therefore it's much better to base it on code coverage. What percentage of the code, and branching logic, are you testing?
If you ask a car salesman for a rough price of a car and he would give me that price, I wouldn't buy my car there, because he forgot to ask me some important questions. What kind of car do you want? Which extras do you want on the car? etc.
Same for number of test cases .... If a hiring manager would ask me that question I would probably give him the following answer.
#test cases = between #Requirements*2 and #Requirements*infinite (some requirements can lead to bollions of possibilities)
I also would say that based on my experience the number would realistically be #Requirements*5 (is the number I use at the initial phase, for projects with new, changed and omitted functionality)
where the following error margin has to be taken depending on the phase I am making this estimate:
Initiation phase : error margins = 400%
...
Testing phase : error margin = 10%
By the time you start the testing phase, detailed requirements/specs are available, volatillity of requirements is stabilized, creep of requirements is almost zero, etc.
At that time I also will be able to give better estimates ...