Except for speed and resource usage, are there any other criteria that two algorithms can compete about? - testing

I intend to race two algorithms and evaluate them. Ignoring developer hindrances such as complexity and deployment difficulties, are there any other criteria which I can test the algorithms against?
By speed I mean the fastest algorithm to return a successful
result.
By resources I mean computational power, memory and storage.
Please note that the algorithms in questions are in fact genetic algorithms. Precisely, a parallel genetic algorithm over a distributed network against a local non-distributed genetic algorithm. So results will differ with each run.

Further criteria might be:
- influence of compiler / optimisation flags
- cpu architecture dependence
For speed you should keep in mind that is can vary from run to run. Often the first is the slowest. A meassure like average of fastest 3 execution time from 10000 might help.

Related

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

SHould I trust BLAS libraries unconditionally to improve performance

I am working on some project that involves computationally intensive image processing algorithms that involve a lot of steps that could be handled by BLAS libraries (mostly level 1 routines). Since my data is quite large it certainly makes sense to consider using BLAS.
I have seen examples where optimised BLAS libraries offer a tremendous increase in performance (factor 10 in speedup for matrix matrix multiplications are nothing unusual).
Should I apply the BLAS functions whenever possible and trust it blindly that it will yield a better performance or should I do a case by case analysis and only apply BLAS where it is necessary?
Blindly applying BLAS has the benefit that I save some time now since I don't have to profile my code in detail. On the other hand, carefully analysing each method might give me the best possible performance but I wonder if it is worth spending a few hours now just to gain half a second later when running the software.
A while agon, I read in a book: (1) Golden rule about optimization: don't do it (2) Golden rule about optimization (for experts only): don't do it yet. In short, I'd recommend to proceed as follows:
step 1: implement the algorithms in the simplest / most legible way
step 2: measure performances
step 3: if (and only if) performances are not satisfactory, use a profiler to detect the hot spots. They are often not where we think !!
step 4: try different alternatives for the hot spots only (measure performances for each alternative)
More speficically about your question: yes, a good implementation of BLAS can make some difference (it may use AVX instruction sets, and for matrix times matrix multiply, decompose the matrix into blocs in a way that is more cache-friendly), but again, I would not "trust unconditionally" (depends on the version of BLAS, on the data, on the target machine etc...), then measuring performances and comparing is absolutely necessary.

Finding computational complexity of genetic algorithm

How can I find the computational complexity of a genetic algorithm? I would appreciate any ideas or examples.
Finding the computational complexity of a genetic algorithm is no different than finding the computational complexity of any other algorithm. You have two options:
Evaluate your code - go through line by line and consider how much time and memory each operation will take.
Time your code (and measure memory usage) with different input sizes to experimentally figure out how it scales (note that this will tell you only average-case complexity, unless you specifically know and test on the best and worst cases).
Some general genetic algorithm specific guidelines:
For a generational genetic algorithm (i.e. one with non-overlapping generations), the time complexity will be at least O(population size * number of generations). Most steady-state genetic algorithms (i.e. those which have a single population which things are added to or removed from over time) will have a similar time complexity, but this is not guaranteed.
The memory complexity for any genetic algorithm needs to be at least O(population size), but can be much larger.
In many cases, evaluating the fitness function is the expensive step of the computation, and so the run-time of various genetic algorithms isa often compared in terms of the number of evaluations that they require to find a good solution.

Optimizing a genetic algorithm?

I've been playing with parallel processing of genetic algorithms to improve performance but I was wondering what some other commonly used techniques are to optimize a genetic algorithm?
Since fitness values are frequently recalculated (the diversity of the population decreases as the algorithm runs), a good strategy to improve the performance of a GA is to reduce the time needed to calculate the fitness.
Details depend on implementation, but previously calculated fitness values can often be
efficiently saved with a hash table. This kind of optimization can drop computation time significantly (e.g. "IMPROVING GENETIC ALGORITHMS PERFORMANCE BY HASHING FITNESS VALUES" - RICHARD J. POVINELLI, XIN FENG reports that the application of hashing to a GA can improve performance by over 50% for complex real-world problems).
A key point is collision management: you can simply overwrite the existing element of the hash table or adopt some sort of scheme (e.g. a linear probe).
In the latter case, as collisions mount, the efficiency of the hash table degrades to that of a linear search. When the cumulative number of collisions exceeds the size of the hash table, a rehash should be performed: you have to create a larger hash table and copy the elements from the smaller hash table to the larger one.
The copy step could be omitted: the diversity decreases as the GA runs, so many of the eliminated elements will not be used and the most frequently used chromosome values will be quickly recalculated (the hash table will fill up again with the most used key element values).
One thing I have done is to limit the number of fitness calculations. For example, where the landscape is not noisy i.e. where a recalculation of fitness would result in the same answer every time, don't recalculate simply cache the answer.
Another approach is to use a memory operator. The operator maintains a 'memory' of solutions and ensures that the best solution in that memory is included in the GA population if it is better than the best in the population. The memory is kept up to date with good solutions during the GA run. This approach can reduce the number of fitness calculations required and increase the performance.
I have examples of some of this stuff here:
http://johnnewcombe.net/blog/gaf-part-8/
http://johnnewcombe.net/blog/gaf-part-3/
This is a very broad question; I suggest to use the R galgo package for this purpose.

FLOPS assigned to sqrt in GPU to measure performance and global efficiency

In a GPU implementation we need to estimate its performance in terms of GLOPS. The code is very basic, but my problem is how many FLOPS should I give to the operations "sqrt" or "mad", whether 1 or more.
Besides, I obtain 50 GFLOPS for my code if 1 say 1 FLOP for these operations, while the theoretical maximum for this GPU is 500GFLOPS. If I express it in precentages I get 10 %. In terms of speedup I get 100 times. So I think it is great, but 10% seems to be a bit low yield, what do you think?
Thanks
The right answer is probably "it depends".
For pure comparative performance between code run on different platforms, I usually count transcendentals, sqrt, mads, as one operation. In that sort of situation, the key performance metric is how long the code takes to run. It is almost impossible to do the comparison any other way - how would you go about comparing the "FLOP" count of a hardware instruction for a transcendental which takes 25 cycles to retire, versus a math library generated stanza of fmad instructions which also takes 25 cycles to complete? Counting instructions or FLOPs becomes meaningless in such a case, both performed the desired operation in the same amount of clock cycles, despite a different apparent FLOP count.
On the other hand, for profiling and performance tuning of a piece of code on given hardware, the FLOP count might be a useful metric to have. In GPUs, it is normal to look at FLOP or IOP count and memory bandwidth utilization to determine where the performance bottleneck of a given code lies. Having those numbers might point you in the direction of useful optimizations.