What makes non linear functions computationally expensive in hardware (e.g. FPGA)? - hardware

I've read some articles that state non-linear functions (like exponentials) are computationally expensive.
I was wondering what makes them computationally expensive.
When referring to 'computationally expensive' does it mean in terms of time taken or hardware resources used?
I've tried searching on Google, but I couldn't find any simple explanations for this.

Not pretending to offer the answer, but start with what you have in fpga.
Normally you're limited to adders, multipliers and some memory. What can you do with those?
Linear function - easy, taking just one multiplier and one adder.
Nonlinear functions - what a those? Either polynomial, requiring you to spend a ton of multipliers (the more the higher the polynomial's degree), or even transcendental, requiring you to find some satisfactory approximation, doing that in many steps.
Even simple integer division can't be done in one clock, in simple implementations requiring as many steps as there's bits in the numbers being divided.
The other possible solution is to use a lookup tables. And it's great for a small range of arguments. But if you want to have the function values found in wide range of arguments, or with greater precision, you'll end up with lookup table that is so large that can't fit in the device you have to work with.
So that's the main costs - you'll spend lots of dedicated hardware resources (multipliers, memory for lookup tables), or spend lots of time in many-steps approximation algorithms, or algorithms that refine the results one "digit" per iteration (integer division, CORDIC, etc).

Related

When calculating the time complexity of an algorithm can we count the addition of two numbers of any size as requiring 1 "unit" of time or O(1) units?

I am working on analysing the time complexity of an algorithm. I am not certain what the correct way of calculating with the time complexity of basic operations such as addition and subtraction of two numbers is. I have learnt that the time complexity of adding up two n digit numbers is O(n), because this is how many elementary bit operations you need to perform during the addition. However, I have heard recently, that nowadays, in modern processors the time taken by adding up two numbers of any size (which is still managable by a computer) is constant: it does not depend on the size of the two numbers. Hence in the time complexity analysis of an algorithm you should calculate the time complexity of adding up two numbers of any size as O(1). Which approach is correct? Or in case the answer is that both approaches are "correct" used in the appropriate context which approach is more acceptable in a research paper? Thank you for any answer in advance.
It depends on the kind of algorithm you are analyzing, but for the general case you are just going to assume the inputs to the algorithm being analyzed will fit into the word-size of the machine it will be performed on (be that 32 bit, 128 bit, whatever), and under that assumption, where any single arithmetic operation will probably be executed as a single machine instruction and be computed in a single or small constant number of CPU clock cycles regardless of the underlying complexity of the hardware implementation, you will treat the complexity of that operation as being O(1). That is, you would assume O(1) complexity for arithmetic operations unless there's a particular reason to assume that they cannot be handled in constant time.
You would only really break the O(1) assumption if you were specifically designing an algorithm to be performed on numerical inputs of arbitrary precision such that you're planning on actually programmatically computing arithmetic operations yourself rather than passing them off completely to hardware (your algorithm expects overflow/underflows and is designed to handle them), or if you were working down at the level of implementing these operations yourself in an ALU or FPU circuit. Then, whether multiplication is performed in O(n*logn) or O(n*logn*loglogn) time in the number of bits would actually become relevant to your complexity analysis: because the number of bits involved in these operations isn't bounded by some constant or you're specifically analyzing the complexity of an algorithm/piece of hardware which is itself implementing an arithmetic operation.

What is the relationship between time complexity and the number of steps in an algorithm?

For large values of n, an algorithm that takes 20000n^2 steps has better time complexity (takes less time) than one that takes 0.001n^5 steps
I believe this statement is true. But, why?
If there are more steps wouldn't that take more time?
Computational complexity is considered in the asymptotic sense because the important question is usually of scaling. Even with your clear case, the ^5 algorithm begins to take longer around 275 items - which isn't very many. See this figure from wolfram alpha:
Quoting from the wikipedia article linked above:
Usually asymptotic estimates are used because different implementations of the same algorithm may differ in efficiency. However the efficiencies of any two "reasonable" implementations of a given algorithm are related by a constant multiplicative factor called a hidden constant.
All that said, if you have two comparable algorithms and the one with less complexity has a significant constant coefficient and you're only going to process 10 items, then it very well may be a good idea to choose the less efficient one. Some common libraries even switch algorithms depending upon the size of the data being processed; this is called a hybrid algorithm and Python's sorted implementation, Timsort uses it to switch between insertion sort and merge sort.

Routh-Hurwitz useful when I can just calculate eigenvalues?

This is for self-study of N-dimensional system of linear homogeneous ordinary differential equations of the form:
dx/dt=Ax
where A is the coefficient matrix of the system.
I have learned that you can check for stability by determining if the real parts of all the eigenvalues of A are negative. You can check for oscillations if there are any purely imaginary eigenvalues of A.
The author in the book I'm reading then introduces the Routh-Hurwitz criterion for detecting stability and oscillations of the system. This seems to be a more efficient computational short-cut than calculating eigenvalues.
What are the advantages of using Routh-Hurwitz criteria for stability and oscillations, when you can just find the eigenvalues quickly nowadays? For instance, will it be useful when I start to study nonlinear dynamics? Is there some additional use that I am completely missing?
Wikipedia entry on RH stability analysis has stuff about control systems, and ends up with a lot of equations in the s-domain (Laplace transforms), but for my applications I will be staying in the time-domain for the most part, and just focusing fairly narrowly on stability and oscillations in linear (or linearized) systems.
My motivation: it seems easy to calculate eigenvalues on my computer, and the Routh-Hurwitz criterion comes off as sort of anachronistic, the sort of thing that might save me a lot of time if I were doing this by hand, but not very helpful for doing analysis of small-fry systems via Matlab.
Edit: I've asked this at Math Exchange, which seems more appropriate:
https://math.stackexchange.com/questions/690634/use-of-routh-hurwitz-if-you-have-the-eigenvalues
There is an accepted answer there.
This is just legacy educational curriculum which fell way behind of the actual computational age. Routh-Hurwitz gives a very nice theoretical basis for parametrization of root positions and linked to much more abstract math.
However, for control purposes it is just a nice trick that has no practical value except maybe simple transfer functions with one or two unknown parameters. It had real value when computing the roots of the polynomials were expensive or even manual. Today, even root finding of polynomials is based on forming the companion matrix and computing the eigenvalues. In fact you can basically form a meshgrid and check the stability surface by plotting the largest real part in a few minutes.

Is flop per second a measure of the speed of a processor, or a measure of the speed of an algorithm?

1) I can see very clearly that: the number of floating point operations a computer can do in one second is a good way of quantifying its performance. That's correct, right?
2) My teacher keeps asking me to calculate the flop rate for algorithms I program. I do this by calculating how many flops the algorithm does and timing how long it takes to run. In this situation the flop rate always falls way short of the flop rate I expect from the computer I'm using. So for algorithms, is a flop rate more an assessment of how long the 'other stuff' takes (i.e. overheads, stuff that doesn't involve flopping). That is, when the flop count is low, most of the programs time is spent calling functions etc. and not performing flop, correct?
I know this is a very broad question but I was hoping for some ideas from those in industry or academia about what they intuitively feel the flop rate of an algorithm actually is.
Properly, “flops” is a measure of processor or system performance. Many people misuse it as a measure of implementation or algorithm speed.
Suppose you had a computation to perform that is fixed in the number of operations it takes. For example, you want to multiply a matrix with dimensions a•b with a matrix with dimensions b•c. If you perform this multiplication in the usual way, then, in each combination of one of a rows and one of c columns, you perform b multiplications and b-1 additions. So the entire matrix multiplication takes a•c•(2b-1) floating-point operations. If it finishes in one second, some people say it is providing a•c•(2b-1) flops.
If you have two programs that both do the multiplication the same way, you can compare them using this figure. The one of them that has more “flops” is better. Even though they use the same algorithm, one of them might have a better implementation, perhaps because it organizes the work more efficiently for memory cache.
This breaks when somebody figures out a new algorithm that gets the same job done with fewer operations. Then some people compare programs (or routines) using the nominal number of operations of the original method, even though the program actually performs fewer operations.
To some extent, this makes sense. If you have two programs that do the same job, and one of them has a higher number of “flops” calculated this way, then it is the program that gives you the answer more quickly.
However, it does not make sense to the extent that it introduces inaccuracy. We are often not interested in a single problem size but in various sizes, and the “flops” of a program will not scale linearly with the nominal number of operations once a new algorithm is used.
By analogy, suppose it is 80 kilometers from town A to town B over the mountain road that everybody uses. If it takes your car an hour to make the trip, your car is traveling 80 kilometers an hour. While out exploring one day, you discover a pass through the mountains that reduces the trip to 70 kilometers. Now you can make the trip in 52.5 minutes. The same calculation that some people do with “flops” would say your car is going 91.4 kilometers per hour, since it makes the 80-kilometer trip in 52.5 minutes.
That is obviously wrong. However, it is useful for deciding which route to take.
FLOPS means the amount of Floating Point Operations Per Second, executed by a processor. That can be a purely theoretical figure derived from some hardware/architecture specification or an empirical result from running some algorithm that is tuned to give high numbers.
The main issue in FLOPS calculation comes from a system, where there are multiple and parallel execution blocks. AFAIK, only in that context it starts to get really tough to split a practical algorithm (e.g. FFT, or RGB->YUV conversion) to the most useful set of instructions, that use all the calculation units in a CPU. (e.g. without automatic vectorization a x64 system often calculates Floating point operations only in the Xmm0[0] register, wasting 50-75% of the full potential.)
This partly answers the question 2. Besides of the obvious stall introduced by cache/memory to register bandwidth, the next crucial obstacle in the way to maximum FLOPS figures is that the data is in the wrong register. That's something that is often completely ignored in complexity analysis that just like FLOPS calculations only count basic arithmetic operations. In case of parallel programming, it often happens, that there are not only one, but 4, 8 or 16 values in wrong registers without any means of easily permuting them all at once. Add that to the overhead, "warm up" and "cool down" stages in an algorithm that tries to occupy all the calculating units with meaningful data and there you have major reasons for getting 100 MFlops out of a 1GFLOPS system.

Difference between Gene Expression Programming and Cartesian Genetic Programming

Something pretty annoying in evolutionary computing is that mildly different and overlapping concepts tend to pick dramatically different names. My latest confusion because of this is that gene-expression-programming seems very similar to cartesian-genetic-programming.
(how) Are these fundamentally different concepts?
I've read that indirect encoding of GP instructions is an effective technique ( both GEP and CGP do that ). Has there been reached some sort of consensus that indirect encoding has outdated classic tree bases GP?
Well, it seems that there is some difference between gene expression programming (GEP) and cartesian genetic programming (CGP or what I view as classic genetic programming), but the difference might be more hyped up than it really ought to be. Please note that I have never used GEP, so all of my comments are based on my experience with CGP.
In CGP there is no distinction between genotype and a phenotype, in other words- if you're looking at the "genes" of a CGP you're also looking at their expression. There is no encoding here, i.e. the expression tree is the gene itself.
In GEP the genotype is expressed into a phenotype, so if you're looking at the genes you will not readily know what the expression is going to look like. The "inventor" of GP, Cândida Ferreira, has written a really good paper and there are some other resources which try to give a shorter overview of the whole concept.
Ferriera says that the benefits are "obvious," but I really don't see anything that would necessarily make GEP better than CGP. Apparently GEP is multigenic, which means that multiple genes are involved in the expression of a trait (i.e. an expression tree). In any case, the fitness is calculated on the expressed tree, so it doesn't seem like GEP is doing anything to increase the fitness. What the author claims is that GEP increases the speed at which the fitness is reached (i.e. in fewer generations), but frankly speaking you can see dramatic performance shifts from a CGP just by having a different selection algorithm, a different tournament structure, splitting the population into tribes, migrating specimens between tribes, including diversity into the fitness, etc.
Selection:
random
roulette wheel
top-n
take half
etc.
Tournament Frequency:
once per epoch
once per every data instance
once per generation.
Tournament Structure:
Take 3, kill 1 and replace it with the child of the other two.
Sort all individuals in the tournament by fitness, kill the lower half and replace it with the offspring of the upper half (where lower is worse fitness and upper is better fitness).
Randomly pick individuals from the tournament to mate and kill the excess individuals.
Tribes
A population can be split into tribes that evolve independently of each-other:
Migration- periodically, individual(s) from a tribe would be moved to another tribe
The tribes are logically separated so that they're like their own separate populations running in separate environments.
Diversity Fitness
Incorporate diversity into the fitness, where you count how many individuals have the same fitness value (thus are likely to have the same phenotype) and you penalize their fitness by a proportionate value: the more individuals with the same fitness value, the more penalty for those individuals. This way specimens with unique phenotypes will be encouraged, therefore there will be much less stagnation of the population.
Those are just some of the things that can greatly affect the performance of a CGP, and when I say greatly I mean that it's in the same order or greater than Ferriera's performance. So if Ferriera didn't tinker with those ideas too much, then she could have seen much slower performance of the CGPs... especially if she didn't do anything to combat stagnation. So I would be careful when reading performance statistics on GEP, because sometimes people fail to account for all of the "optimizations" available out there.
There seems to be some confusion in these answers that must be clarified. Cartesian GP is different from classic GP (aka tree-based GP), and GEP. Even though they share many concepts and take inspiration from the same biological mechanisms, the representation of the individuals (the solutions) varies.
In CGPthe representation (mapping between genotype and phenotype) is indirect, in other words, not all of the genes in a CGP genome will be expressed in the phenome (a concept also found in GEP and many others). The genotypes can be coded in a grid or array of nodes, and the resulting program graph is the expression of active nodes only.
In GEP the representation is also indirect, and similarly not all genes will be expressed in the phenotype. The representation in this case is much different from treeGP or CGP, but the genotypes are also expressed into a program tree. In my opinion GEP is a more elegant representation, easier to implement, but also suffers from some defects like: you have to find the appropriate tail and head size which is problem specific, the mnltigenic version is a bit of a forced glue between expression trees, and finally it has too much bloat.
Independently of which representation may be better than the other in some specific problem domain, they are general purpose, can be applied to any domain as long as you can encode it.
In general, GEP is simpler from GP. Let's say you allow the following nodes in your program: constants, variables, +, -, *, /, if, ...
For each of such nodes with GP you must create the following operations:
- randomize
- mutate
- crossover
- and probably other genetic operators as well
In GEP for each of such nodes only one operation is needed to be implemented: deserialize, which takes array of numbers (like double in C or Java), and returns the node. It resembles object deserialization in languages like Java or Python (the difference is that deserialization in programming languages uses byte arrays, where here we have arrays of numbers). Even this 'deserialize' operation doesn't have to be implemented by the programmer: it can be implemented by a generic algorithm, just like it's done in Java or Python deserialization.
This simplicity from one point of view may make searching of best solution less successful, but from other side: requires less work from programmer and simpler algorithms may execute faster (easier to optimize, more code and data fits in CPU cache, and so on). So I would say that GEP is slightly better, but of course the definite answer depends on problem, and for many problems the opposite may be true.