Is flop per second a measure of the speed of a processor, or a measure of the speed of an algorithm? - optimization

1) I can see very clearly that: the number of floating point operations a computer can do in one second is a good way of quantifying its performance. That's correct, right?
2) My teacher keeps asking me to calculate the flop rate for algorithms I program. I do this by calculating how many flops the algorithm does and timing how long it takes to run. In this situation the flop rate always falls way short of the flop rate I expect from the computer I'm using. So for algorithms, is a flop rate more an assessment of how long the 'other stuff' takes (i.e. overheads, stuff that doesn't involve flopping). That is, when the flop count is low, most of the programs time is spent calling functions etc. and not performing flop, correct?
I know this is a very broad question but I was hoping for some ideas from those in industry or academia about what they intuitively feel the flop rate of an algorithm actually is.

Properly, “flops” is a measure of processor or system performance. Many people misuse it as a measure of implementation or algorithm speed.
Suppose you had a computation to perform that is fixed in the number of operations it takes. For example, you want to multiply a matrix with dimensions a•b with a matrix with dimensions b•c. If you perform this multiplication in the usual way, then, in each combination of one of a rows and one of c columns, you perform b multiplications and b-1 additions. So the entire matrix multiplication takes a•c•(2b-1) floating-point operations. If it finishes in one second, some people say it is providing a•c•(2b-1) flops.
If you have two programs that both do the multiplication the same way, you can compare them using this figure. The one of them that has more “flops” is better. Even though they use the same algorithm, one of them might have a better implementation, perhaps because it organizes the work more efficiently for memory cache.
This breaks when somebody figures out a new algorithm that gets the same job done with fewer operations. Then some people compare programs (or routines) using the nominal number of operations of the original method, even though the program actually performs fewer operations.
To some extent, this makes sense. If you have two programs that do the same job, and one of them has a higher number of “flops” calculated this way, then it is the program that gives you the answer more quickly.
However, it does not make sense to the extent that it introduces inaccuracy. We are often not interested in a single problem size but in various sizes, and the “flops” of a program will not scale linearly with the nominal number of operations once a new algorithm is used.
By analogy, suppose it is 80 kilometers from town A to town B over the mountain road that everybody uses. If it takes your car an hour to make the trip, your car is traveling 80 kilometers an hour. While out exploring one day, you discover a pass through the mountains that reduces the trip to 70 kilometers. Now you can make the trip in 52.5 minutes. The same calculation that some people do with “flops” would say your car is going 91.4 kilometers per hour, since it makes the 80-kilometer trip in 52.5 minutes.
That is obviously wrong. However, it is useful for deciding which route to take.

FLOPS means the amount of Floating Point Operations Per Second, executed by a processor. That can be a purely theoretical figure derived from some hardware/architecture specification or an empirical result from running some algorithm that is tuned to give high numbers.
The main issue in FLOPS calculation comes from a system, where there are multiple and parallel execution blocks. AFAIK, only in that context it starts to get really tough to split a practical algorithm (e.g. FFT, or RGB->YUV conversion) to the most useful set of instructions, that use all the calculation units in a CPU. (e.g. without automatic vectorization a x64 system often calculates Floating point operations only in the Xmm0[0] register, wasting 50-75% of the full potential.)
This partly answers the question 2. Besides of the obvious stall introduced by cache/memory to register bandwidth, the next crucial obstacle in the way to maximum FLOPS figures is that the data is in the wrong register. That's something that is often completely ignored in complexity analysis that just like FLOPS calculations only count basic arithmetic operations. In case of parallel programming, it often happens, that there are not only one, but 4, 8 or 16 values in wrong registers without any means of easily permuting them all at once. Add that to the overhead, "warm up" and "cool down" stages in an algorithm that tries to occupy all the calculating units with meaningful data and there you have major reasons for getting 100 MFlops out of a 1GFLOPS system.

Related

When calculating the time complexity of an algorithm can we count the addition of two numbers of any size as requiring 1 "unit" of time or O(1) units?

I am working on analysing the time complexity of an algorithm. I am not certain what the correct way of calculating with the time complexity of basic operations such as addition and subtraction of two numbers is. I have learnt that the time complexity of adding up two n digit numbers is O(n), because this is how many elementary bit operations you need to perform during the addition. However, I have heard recently, that nowadays, in modern processors the time taken by adding up two numbers of any size (which is still managable by a computer) is constant: it does not depend on the size of the two numbers. Hence in the time complexity analysis of an algorithm you should calculate the time complexity of adding up two numbers of any size as O(1). Which approach is correct? Or in case the answer is that both approaches are "correct" used in the appropriate context which approach is more acceptable in a research paper? Thank you for any answer in advance.
It depends on the kind of algorithm you are analyzing, but for the general case you are just going to assume the inputs to the algorithm being analyzed will fit into the word-size of the machine it will be performed on (be that 32 bit, 128 bit, whatever), and under that assumption, where any single arithmetic operation will probably be executed as a single machine instruction and be computed in a single or small constant number of CPU clock cycles regardless of the underlying complexity of the hardware implementation, you will treat the complexity of that operation as being O(1). That is, you would assume O(1) complexity for arithmetic operations unless there's a particular reason to assume that they cannot be handled in constant time.
You would only really break the O(1) assumption if you were specifically designing an algorithm to be performed on numerical inputs of arbitrary precision such that you're planning on actually programmatically computing arithmetic operations yourself rather than passing them off completely to hardware (your algorithm expects overflow/underflows and is designed to handle them), or if you were working down at the level of implementing these operations yourself in an ALU or FPU circuit. Then, whether multiplication is performed in O(n*logn) or O(n*logn*loglogn) time in the number of bits would actually become relevant to your complexity analysis: because the number of bits involved in these operations isn't bounded by some constant or you're specifically analyzing the complexity of an algorithm/piece of hardware which is itself implementing an arithmetic operation.

What is the relationship between time complexity and the number of steps in an algorithm?

For large values of n, an algorithm that takes 20000n^2 steps has better time complexity (takes less time) than one that takes 0.001n^5 steps
I believe this statement is true. But, why?
If there are more steps wouldn't that take more time?
Computational complexity is considered in the asymptotic sense because the important question is usually of scaling. Even with your clear case, the ^5 algorithm begins to take longer around 275 items - which isn't very many. See this figure from wolfram alpha:
Quoting from the wikipedia article linked above:
Usually asymptotic estimates are used because different implementations of the same algorithm may differ in efficiency. However the efficiencies of any two "reasonable" implementations of a given algorithm are related by a constant multiplicative factor called a hidden constant.
All that said, if you have two comparable algorithms and the one with less complexity has a significant constant coefficient and you're only going to process 10 items, then it very well may be a good idea to choose the less efficient one. Some common libraries even switch algorithms depending upon the size of the data being processed; this is called a hybrid algorithm and Python's sorted implementation, Timsort uses it to switch between insertion sort and merge sort.

What makes non linear functions computationally expensive in hardware (e.g. FPGA)?

I've read some articles that state non-linear functions (like exponentials) are computationally expensive.
I was wondering what makes them computationally expensive.
When referring to 'computationally expensive' does it mean in terms of time taken or hardware resources used?
I've tried searching on Google, but I couldn't find any simple explanations for this.
Not pretending to offer the answer, but start with what you have in fpga.
Normally you're limited to adders, multipliers and some memory. What can you do with those?
Linear function - easy, taking just one multiplier and one adder.
Nonlinear functions - what a those? Either polynomial, requiring you to spend a ton of multipliers (the more the higher the polynomial's degree), or even transcendental, requiring you to find some satisfactory approximation, doing that in many steps.
Even simple integer division can't be done in one clock, in simple implementations requiring as many steps as there's bits in the numbers being divided.
The other possible solution is to use a lookup tables. And it's great for a small range of arguments. But if you want to have the function values found in wide range of arguments, or with greater precision, you'll end up with lookup table that is so large that can't fit in the device you have to work with.
So that's the main costs - you'll spend lots of dedicated hardware resources (multipliers, memory for lookup tables), or spend lots of time in many-steps approximation algorithms, or algorithms that refine the results one "digit" per iteration (integer division, CORDIC, etc).

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

FLOPS assigned to sqrt in GPU to measure performance and global efficiency

In a GPU implementation we need to estimate its performance in terms of GLOPS. The code is very basic, but my problem is how many FLOPS should I give to the operations "sqrt" or "mad", whether 1 or more.
Besides, I obtain 50 GFLOPS for my code if 1 say 1 FLOP for these operations, while the theoretical maximum for this GPU is 500GFLOPS. If I express it in precentages I get 10 %. In terms of speedup I get 100 times. So I think it is great, but 10% seems to be a bit low yield, what do you think?
Thanks
The right answer is probably "it depends".
For pure comparative performance between code run on different platforms, I usually count transcendentals, sqrt, mads, as one operation. In that sort of situation, the key performance metric is how long the code takes to run. It is almost impossible to do the comparison any other way - how would you go about comparing the "FLOP" count of a hardware instruction for a transcendental which takes 25 cycles to retire, versus a math library generated stanza of fmad instructions which also takes 25 cycles to complete? Counting instructions or FLOPs becomes meaningless in such a case, both performed the desired operation in the same amount of clock cycles, despite a different apparent FLOP count.
On the other hand, for profiling and performance tuning of a piece of code on given hardware, the FLOP count might be a useful metric to have. In GPUs, it is normal to look at FLOP or IOP count and memory bandwidth utilization to determine where the performance bottleneck of a given code lies. Having those numbers might point you in the direction of useful optimizations.