How computationally expensive is `exp`? - gpu

I am currently hearing a lecture about automatic speech recognition (ASR). The last lecture was about vector quantization (VQ) and k nearest neighbors (kNN) as well as binary trees and gaussian mixture models (GMMs).
According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians. According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp, natural exponential function) is computationally expensive.
I was curious if this is (still) true, searched for the Python implementation and found this answer which explains that exp is calculated by the hardware.
Todays CPUs (and GPUs) are complex and I have very limited knowledge of them. It could still be true that exp is much more expensive than e.g. comparisons of floats, additions or multiplications.
Questions
How expensive is exp in comparison to float comparisons, additions, multiplications and similar basic commands?
Did I eventually understand something wrong why VQ is done in ASR?
Experimental evaluation
I tried to get a result by starting an experiment. But it is difficult for me to eliminate other effects from making my numbers wrong (e.g. caches, variable lookup times, time of random number generator, ...).
Currently, I have
#!/usr/bin/env python
import math
import time
import random
# Experiment settings
numbers = 5000000
seed = 0
repetitions = 10
# Experiment
random.seed(seed)
values = [random.uniform(-5, 5) for _ in range(numbers)]
v2 = [random.uniform(-5, 5) for _ in range(numbers)]
# Exp
for i in range(repetitions):
t0 = time.time()
ret = [math.exp(x) for x in values]
t1 = time.time()
time_delta = t1 - t0
print("Exp time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))
# Comparison
for i in range(repetitions):
t0 = time.time()
ret = [x+y for x, y in zip(values, v2)]
t1 = time.time()
time_delta = t1 - t0
print("x+y time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))
But I guess zip makes this one fail, because the result is:
Exp time: 1.3640s (3665573.5997 per second)
Exp time: 1.7404s (2872978.6149 per second)
Exp time: 1.5441s (3238178.6480 per second)
Exp time: 1.5161s (3297876.5227 per second)
Exp time: 1.9912s (2511009.5658 per second)
Exp time: 1.3086s (3820818.9478 per second)
Exp time: 1.4770s (3385254.5642 per second)
Exp time: 1.5179s (3294040.1828 per second)
Exp time: 1.3198s (3788392.1744 per second)
Exp time: 1.5752s (3174296.9903 per second)
x+y time: 9.1045s (549179.7651 per second)
x+y time: 2.2017s (2270981.5582 per second)
x+y time: 2.0781s (2406097.0233 per second)
x+y time: 2.1386s (2338005.6240 per second)
x+y time: 1.9963s (2504681.1570 per second)
x+y time: 2.1617s (2313042.3523 per second)
x+y time: 2.3166s (2158293.4313 per second)
x+y time: 2.2966s (2177155.9497 per second)
x+y time: 2.2939s (2179730.8867 per second)
x+y time: 2.3094s (2165055.9488 per second)

According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians.
This is a correct description. You can find an interesting description of an optimal gaussian computation in the following paper:
George Saon, Daniel Povey & Geoffrey Zweig, "Anatomy of an extremely fast LVCSR decoder," Interspeech 2005.
http://www.danielpovey.com/files/eurospeech05_george_decoder.pdf
likelihood computation section
According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp, natural exponential function) is computationally expensive.
At this part you probably misunderstood the lecturer. The exp is not a very significant issue. The Gaussian computation is expensive for other reasons: there are several thousand Gaussian scored every frame each with a few dozen components each of 40 floats. It is expensive to process all this data due to the amount of memory you need to feed and store. Gaussian selection helps here to reduce the amount of Gaussian several folds and thus speeds up the computation.
Using a GPU is another solution to this problem. By moving scoring to GPU you can significantly speedup scoring. However, there is an issue with HMM search in that it can not be easily parallelized. This is another important part of the decoding and even if you reduce scoring to zero, the decoding will still be slow still due to the search.
Exp time: 1.5752s (3174296.9903 per second)
x+y time: 9.1045s (549179.7651 per second)
This is not a meaningful comparison. There are many things you ignored here like the cost of the Python zip call (izip is better). This way you can demonstrate any result easily.

Related

Estimation of Execution Time based on GFLOPS and Time Complexity

I have a CPU with 83.2 GFLOPS/s + 4 cores. So i understand that each core is (83.2 / 4) = 20.8 GFLOPS/s.
What i am trying to do is to estimate the execution time of an algorithm. I found that we can estimate the execution time roughly by using the following formula :
estimation_exec_time = algorithm_time_complexity / Gflops/s
So if we have a bubble_sort algorithm with time complexity O(n^2) that runs on a VM that uses 1 core of my CPU the estimation exec time would be :
estimation_exec_time = n^2 / 20.8 GFLOPS/s
The problem is that the estimation execution time is completely different from the real execution time when i am timing my code..
To be more specific the formula returns an estimation of 0.00004807s
and the real execution time gives a result of 0.74258s
Is this approach with the formula false?

kNN-DTW time complexity

I found from various online sources that the time complexity for DTW is quadratic. On the other hand, I also found that standard kNN has linear time complexity. However, when pairing them together, does kNN-DTW have quadratic or cubic time?
In essence, does the time complexity of kNN solely depend on the metric used? I have not found any clear answer for this.
You need to be careful here. Let's say you have n time series in your 'training' set (let's call it this, even though you are not really training with kNN) of length l. Computing the DTW between a pair of time series has a asymptotic complexity of O(l * m) where m is your maximum warping window. As m <= l also O(l^2) holds. (although there might be more efficient implementations, i don't think they are actually faster in practice in most cases, see here). Classifying a time series using kNN requires you to compute the distance between that time series and all time series in the training set which would mean n comparisons, linear with respect to n.
So your final complexity would be in O(l * m * n) or O(l^2 * n). In words: the complexity is quadratic with respect to time series length and linear with respect to the number of training examples.

Time complexity of sequential operation with different parameters

I have a function that has two sequential operations.
I compute the time complexity of these:
O(n) + O(kn^(1-1/k))
What is the total time complexity of the function? Is correct to say that is O(n+kn^(1-1/k))?

Computational complexity depending on two variables

I have an algorithm and it is mainly composed of k-NN , followed by a computation involving finding permutations, followed by some for loops. Line by line, my computational complexity is :
O(n) - for k-NN
O(2^k) - for a part that computes singlets, pairs, triplets, etc.
O(k!) - for a part that deals with combinatorics.
O(k*k!) - for the final part.
K here is a parameter that can be chosen by the user, in general it is somewhat small (10-100). n is the number of examples in my dataset, and this can get very large.
What is the overall complexity of my algorithm? Is it simply O(n) ?
As k <= 100, f(k) = O(1) for every function f.
In your case, there is a function f such that the overall time is O(n + f(k)), so it is O(n)

How can I speed up application of Log(x+1) to a sparse array in Julia

A sparse matrix in Julia only stores nonzero elements.
Some functions, such as log(x+1) (in all bases),
map zero to zero, and thus don't need to be applied to those zero elements.
(I think we would call this a Monoid homomorphism.)
How can I use this fact to speed up an operation?
Example code:
X = sprand(10^4,10^4, 10.0^-5, rand)
function naiveLog2p1(N::SparseMatrixCSC{Float64,Int64})
log2(1+N) |> sparse
end
Running:
#time naiveLog2p1(X)
Output is:
elapsed time: 2.580125482 seconds (2289 MB allocated, 6.86% gc time in 3 pauses with 0 full sweep)
On a second time (so that the function is expected to be already compiled):
elapsed time: 2.499118888 seconds (2288 MB allocated, 8.17% gc time in 3 pauses with 0 full sweep)
Little change, presumably cos it is so simple to compile.
As per suggestion of the Julia manual on "Sparse matrix operations" I would convert the sparse matrix into a dense one using findnz(), do the log operations on the values and the reconstruc the sparse matrix with sparse().
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
I,J,V = findnz(N)
return sparse(I,J,log2(1+V))
end
#time improvedLog2p1(X)
elapsed time: 0.000553508 seconds (473288 bytes allocated)
My solution would be to actually operate on the inside of the data structure itself:
mysparselog(N::SparseMatrixCSC) =
SparseMatrixCSC(N.m, N.n, copy(N.colptr), copy(N.rowval), log2(1+N.nzval))
Note that if you want to operate on the sparse matrix in place, which would be fairly often in practice I imagine, this would be a zero-memory operation. Benchmarking reveals this performs similar to the #Oxinabox answer, as it is about the same in terms of memory operations (although that answer doesn't actually return the new matrix, as shown by the mean output):
with warmup times removed:
naiveLog2p1
elapsed time: 1.902405905 seconds (2424151464 bytes allocated, 10.35% gc time)
mean(M) => 0.005568094618997372
mysparselog
elapsed time: 0.022551705 seconds (24071168 bytes allocated)
elapsed time: 0.025841895 seconds (24071168 bytes allocated)
mean(M) => 0.005568094618997372
improvedLog2p1
elapsed time: 0.018682775 seconds (32068160 bytes allocated)
elapsed time: 0.027129497 seconds (32068160 bytes allocated)
mean(M) => 0.004995127985160583
What you are looking for is the sparse nonzeros function.
nonzeros(A)
Return a vector of the structural nonzero values in sparse matrix A. This includes zeros that are explicitly stored in the sparse
matrix. The returned vector points directly to the internal nonzero
storage of A, and any modifications to the returned vector will mutate
A as well.
You can use this as below:
function improvedLog2p1(N::SparseMatrixCSC{Float64,Int64})
M = copy(N)
ms = nonzeros(M) #Creates a view,
ms = log2(1+ms) #changes to ms, change M
M
end
#time improvedLog2p1(X)
running for the first time output is:
elapsed time: 0.002447847 seconds (157 kB allocated)
running for the second time output is:
0.000102335 seconds (109 kB allocated)
That is a 4 orders of magnitude improvement in speed and memory use.