Estimation of Execution Time based on GFLOPS and Time Complexity - time-complexity

I have a CPU with 83.2 GFLOPS/s + 4 cores. So i understand that each core is (83.2 / 4) = 20.8 GFLOPS/s.
What i am trying to do is to estimate the execution time of an algorithm. I found that we can estimate the execution time roughly by using the following formula :
estimation_exec_time = algorithm_time_complexity / Gflops/s
So if we have a bubble_sort algorithm with time complexity O(n^2) that runs on a VM that uses 1 core of my CPU the estimation exec time would be :
estimation_exec_time = n^2 / 20.8 GFLOPS/s
The problem is that the estimation execution time is completely different from the real execution time when i am timing my code..
To be more specific the formula returns an estimation of 0.00004807s
and the real execution time gives a result of 0.74258s
Is this approach with the formula false?

Related

What is the meaning of round complexity in distributed network algorithms

Can someone explain what is the meaning of the time complexity in distributed networking algorithms? The definition given in DNA book by Panduranga is as follow :
"In the synchronous model, time is measured by the number of clock ticks called rounds, i.e., processors are said to compute in “lock step”. When running a distributed algorithm, different nodes might take a different number of rounds to finish. In that case, the maximum time needed over all nodes is taken as the time complexity"
Can you explain the above definition with a simple example
Let's say you want to compute the sum of a really large list (say, 1 billion numbers). To speed things up, you use 4 threads, each computing the sum of 250 million rows, which can then be added to get the total sum. If the time taken for each thread to run is:
thread1 takes 43 seconds
thread2 takes 39 seconds
thread3 takes 40 seconds
thread4 takes 41 seconds
Then you would say that the runtime of this operation is bounded by the thread that takes the longest, in this case 43 seconds. It doesn't matter if the other threads take 2 seconds, the longest task determines the runtime of your algorithm.

An algorithm implemented with O(1) time can not be improved faster, (Yes or No)? And why?

An algorithm implemented with O(1) time can not be improved faster, (Yes or No)? And why?
In my opinion, algorithm implemented with O(1) time means it's running time is constant. It's running time does not depend on value of n, like size of array or # of loops iteration. Independent of all these factors, it will always run for constant time like for example say 10 steps or 1 steps.
Since it's performing constant amount of steps, there is no scope to improve it's performance or make it faster.
An O(1) algorithm still have room for improvements, instead of lowering time complexity (no longer achievable) you can lower the constant factor.
Example:
T_1(n) = c_1 * 1 is a lot better than T_2(n) = c_2 * 1 if c1 << c2.

How computationally expensive is `exp`?

I am currently hearing a lecture about automatic speech recognition (ASR). The last lecture was about vector quantization (VQ) and k nearest neighbors (kNN) as well as binary trees and gaussian mixture models (GMMs).
According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians. According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp, natural exponential function) is computationally expensive.
I was curious if this is (still) true, searched for the Python implementation and found this answer which explains that exp is calculated by the hardware.
Todays CPUs (and GPUs) are complex and I have very limited knowledge of them. It could still be true that exp is much more expensive than e.g. comparisons of floats, additions or multiplications.
Questions
How expensive is exp in comparison to float comparisons, additions, multiplications and similar basic commands?
Did I eventually understand something wrong why VQ is done in ASR?
Experimental evaluation
I tried to get a result by starting an experiment. But it is difficult for me to eliminate other effects from making my numbers wrong (e.g. caches, variable lookup times, time of random number generator, ...).
Currently, I have
#!/usr/bin/env python
import math
import time
import random
# Experiment settings
numbers = 5000000
seed = 0
repetitions = 10
# Experiment
random.seed(seed)
values = [random.uniform(-5, 5) for _ in range(numbers)]
v2 = [random.uniform(-5, 5) for _ in range(numbers)]
# Exp
for i in range(repetitions):
t0 = time.time()
ret = [math.exp(x) for x in values]
t1 = time.time()
time_delta = t1 - t0
print("Exp time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))
# Comparison
for i in range(repetitions):
t0 = time.time()
ret = [x+y for x, y in zip(values, v2)]
t1 = time.time()
time_delta = t1 - t0
print("x+y time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))
But I guess zip makes this one fail, because the result is:
Exp time: 1.3640s (3665573.5997 per second)
Exp time: 1.7404s (2872978.6149 per second)
Exp time: 1.5441s (3238178.6480 per second)
Exp time: 1.5161s (3297876.5227 per second)
Exp time: 1.9912s (2511009.5658 per second)
Exp time: 1.3086s (3820818.9478 per second)
Exp time: 1.4770s (3385254.5642 per second)
Exp time: 1.5179s (3294040.1828 per second)
Exp time: 1.3198s (3788392.1744 per second)
Exp time: 1.5752s (3174296.9903 per second)
x+y time: 9.1045s (549179.7651 per second)
x+y time: 2.2017s (2270981.5582 per second)
x+y time: 2.0781s (2406097.0233 per second)
x+y time: 2.1386s (2338005.6240 per second)
x+y time: 1.9963s (2504681.1570 per second)
x+y time: 2.1617s (2313042.3523 per second)
x+y time: 2.3166s (2158293.4313 per second)
x+y time: 2.2966s (2177155.9497 per second)
x+y time: 2.2939s (2179730.8867 per second)
x+y time: 2.3094s (2165055.9488 per second)
According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians.
This is a correct description. You can find an interesting description of an optimal gaussian computation in the following paper:
George Saon, Daniel Povey & Geoffrey Zweig, "Anatomy of an extremely fast LVCSR decoder," Interspeech 2005.
http://www.danielpovey.com/files/eurospeech05_george_decoder.pdf
likelihood computation section
According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp, natural exponential function) is computationally expensive.
At this part you probably misunderstood the lecturer. The exp is not a very significant issue. The Gaussian computation is expensive for other reasons: there are several thousand Gaussian scored every frame each with a few dozen components each of 40 floats. It is expensive to process all this data due to the amount of memory you need to feed and store. Gaussian selection helps here to reduce the amount of Gaussian several folds and thus speeds up the computation.
Using a GPU is another solution to this problem. By moving scoring to GPU you can significantly speedup scoring. However, there is an issue with HMM search in that it can not be easily parallelized. This is another important part of the decoding and even if you reduce scoring to zero, the decoding will still be slow still due to the search.
Exp time: 1.5752s (3174296.9903 per second)
x+y time: 9.1045s (549179.7651 per second)
This is not a meaningful comparison. There are many things you ignored here like the cost of the Python zip call (izip is better). This way you can demonstrate any result easily.

CPU Scheduling : Finding burst time

In the FCFS scheduling algorithm the drawback is that if a process P1 with a higher burst time comes before some processes P2,P3... with much smaller burst times then the average waiting time and average completion time is pretty high.
A solution to this problem is to schedule the Shortest Job First(SJF Algo).
But how is the burst time computed in advance? Does the developer specify a formula by which (according to the resources available) the burst time to perform a job is computed in advance?
Estimating burst time of a process is a very large topic .
in general scheduler estimates the length of the next burst based on the lengths of recent cpu bursts. basically what we do is to guess the next CPU burst time by assuming that it will be related to past CPU bursts for that process .
A quick google search led me to this article which will give you a basic idea .
here is a more detailed article
This can be done using an exponential average estimation formula-
Estimated CPU Burst time for (n+1)th CPU burst=(alpha)(Actual CPU Burst time for nth CPU Burst)+(1-alpha)(Estimated CPU Burst time for nth CPU Burst).
where,
alpha=a constant varies between 0<=alpha<=1.
Actual CPU Burst time for nth CPU burst= It is the most recent CPU Burst time of the process/job.
Estimated CPU Burst time for nth CPU burst= It gives us an idea of history of the process/job ie how previously we have estimated CPU Burst time.
For the first time execution (alpha=1), we have to execute the process/job once.
this gives us (Actual CPU Burst time for nth CPU Burst),
Now, we can estimate the upcoming CPU burst time values by varying alpha.

clock() accuracy

I have seen many posts about using the clock() function to determine the amount of elapsed time in a program with code looking something like:
start_time = clock();
//code to be timed
.
.
.
end_time = clock();
elapsed_time = (end_time - start_time) / CLOCKS_PER_SEC;
The value of CLOCKS_PER_SEC is almost surely not the actual number of clock ticks per second so I am a bit wary of the result. Without worrying about threading and I/O, is the output of the clock() function being scaled in some way so that this divison produces the correct wall clock time?
The answer to your question is yes.
clock() in this case refers to a wallclock rather than a CPU clock so it could be misleading at first glance. For all the machines and compilers I've seen, it returns the time in milliseconds since I've never seen a case where CLOCKS_PER_SEC isn't 1000. So the precision of clock() is limited to milliseconds and the accuracy is usually slightly less.
If you're interested in the actual cycles, this can be hard to obtain.
The rdtsc instruction will let you access the number "pseudo"-cycles from when the CPU was booted. On older systems (like Intel Core 2), this number is usually the same as the actual CPU frequency. But on newer systems, it isn't.
To get a more accurate timer than clock(), you will need to use the hardware performance counters - which is specific to the OS. These are internally implemented using the 'rdtsc' instruction from the last paragraph.