Comparison of Space and Time Complexity in Big-O between kNN (Nearest Neighbor) and any Classifier? - time-complexity

Comparison of Space and Time Complexity in Big-O between kNN (Nearest Neighbor) and any Classifier?
Thoughts about kNN:
For a case base used in kNN with n examples and d features per example, the time complexity is O(n x k x d). It is the worst amount of computations that need to be computed.
Space complexity: I would say it is O(n*+1** x k x d)* - the plus one is for the query.
Thoughts about a Classifier:
Takes zero neighbours for classification and does not require to persist examples in the memory so n=0 and k=0. The feature dimension d remains.
In addition, we have parameters for the classifier. If it is a single fully connected neural network layer (feedforward / MLP) with o output neurons and a softmax layer, we could say:
Time complexity: O(d x o + o)
where d is the dimension of input features and o of the output, resulting in d x o multiplications. And the adding of o is for the softmax layer.
Space complexity: O(d x o + o) since I have to load only one example with d features into my memory and the weights of the network.
So since o << k x n, we can say that a classifier's space and time complexity is smaller than for using kNN. Right?
Are these thoughts correct?

Related

Fast algorithm for computing cofactor matrix

I wonder if there is a fast algorithm, say (O(n^3)) for computing the cofactor matrix (or conjugate matrix) of a N*N square matrix. And yes one could first compute its determinant and inverse separately and then multiply them together. But how about this square matrix is non-invertible?
I am curious about the accepted answer here:Speed up python code for computing matrix cofactors
What would it mean by "This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition)."?
Factorize M = L x D x U, whereL is lower triangular with ones on the main diagonal,U is upper triangular on the main diagonal, andD is diagonal.
You can use back-substitution as with Cholesky factorization, which is similar. Then,
M^{ -1 } = U^{ -1 } x D^{ -1 } x L^{ -1 }
and then transpose the cofactor matrix as :
Cof( M )^T = Det( U ) x Det( D ) x Det( L ) x M^{ -1 }.
If M is singular or nearly so, one element (or more) of D will be zero or nearly zero. Replace those elements with zero in the matrix product and 1 in the determinant, and use the above equation for the transpose cofactor matrix.

How to find time-varying coefficients for a VAR model by using the Kalman Filter

I'm trying to write some code in R to reproduce the model i found in this article.
The idea is to model the signal as a VAR model, but fit the coefficients by a Kalman-filter model. This would essentially enable me to create a robust time-varying VAR(p) model and analyze non-stationary data to a degree.
The model to track the coefficients is:
X(t) = F(t) X(t− 1) +W(t)
Y(t) = H(t) X(t) + E(t),
where H(t) is the Kronecker product between lagged measurements in my time-series Y and a unit vector, and X(t) fills the role of regression-coefficients. F(t) is taken to be an identity matrix, as that should mean we assume coefficients to evolve as a random walk.
In the article, from W(T), the state noise covariance matrix Q(t) is chosen at 10^-3 at first and then fitted based on some iteration scheme. From E(t) the state noise covariance matrix is R(t) substituted by the covariance of the noise term unexplained by the model: Y(t) - H(t)Xhat(t)
I have the a priori covariance matrix of estimation error (denoted Σ in the article) written as P (based on other sources) and the a posteriori as Pmin, since it will be used in the next recursion as a priori, if that makes sense.
So far i've written the following, based on the articles Appendix A 1.2
Y <- *my timeseries, for test purposes two channels of 3000 points*
F <- diag(8) # F is (m^2*p by m^2 *p) where m=2 dimensions and p =2 lags
H <- diag(2) %x% t(vec(Y[,1:2])) #the Kronecker product of vectorized lags Y-1 and Y-2
Xhatminus <- matrix(1,8,1) # an arbitrary *a priori* coefficient matrix
Q <- diag(8)%x%(10**-7) #just a number a really low number diagonal matrix, found it used in some examples
R<- 1 #Didnt know what else to put here just yet
Pmin = diag(8) #*a priori* error estimate, just some 1-s...
Now should start the reccursion. To test i just took the first 3000 points of one trial of my data.
Xhatstorage <- matrix(0,8,3000)
for(j in 3:3000){
H <- diag(2) %x% t(vec(Y[,(j-2):(j-1)]))
K <- (Pmin %*% t(H)) %*% solve((H%*%Pmin%*%t(H) + R)) ##Solve gives inverse matrix ()^-1
P <- Pmin - K%*% H %*% Pmin
Xhatplus <- F%*%( Xhatminus + K%*%(Y[,j]-H%*%Xhatminus) )
Pplus <- (F%*% P %*% F) + Q
Xhatminus <- Xhatplus
Xhatstorage[,j] <- Xhatplus
Pmin <- Pplus
}
I extracted Xhatplus values into a storage matrix and used them to write this primitive VAR model with them:
Yhat<-array(0,3000)
for(t in 3:3000){
Yhat[t]<- t(vec(Y[,(t-2)])) %*% Xhatstorage[c(1,3),t] + t(vec(Y[,(t-1)])) %*% Xhatstorage[c(2,4),t]
}
The looks like this .
The blue line is VAR with Kalman filter found coefficients, Black is original data..
I'm having issue understanding how i can better evaluate my coefficients? Why is it so off?
How should i better choose the first a priori and a posteriori estimates to start the recursion? Currently, adding more lags to the VAR is not the issue i'm sure, it's that i don't know how to choose the initial values for Pmin and Xhatmin. Most places i pieced this together from start from arbitrary 0 assumptions in toy models, but in this case, choosing any of the said matrixes as 0 will just collapse the entire algorithm.
Lastly, is this recursion even a correct implementation of Oya et al describe in the article? I know im still missing the R evaluation based on previously unexplained errors (V(t) in Appendix A 1.2), but in general?

Stable sampling of large Gaussian distributions

I'm trying to sample a Gaussian distribution of covariance matrix P that is N by N, with N very large (around 4000 ).
Usually one would proceed like so:
Compute the Cholesky decomposition of P : L, such that L * L.T = P
Sample a normal Gaussian distribution : X ~N(0,I_N), where I_N is the identity and N = 4000
Obtain the desired sample Y from Y = L * X
The snag here is in the computation of L. The algorithm does not seem to be stable for such a large matrix, as the computed Cholesky decomposition does not satisfy L * L.T != P.
I've tried to normalize P before computing its Cholesky decomposition (dividing it by its largest value), to no avail. I'm using the C++ library Eigen, and I've noticed this problem with numpy as well.
Any advice?
Cholesky decomposition should be quite stable, if the input matrix is actually positive definite. It can have issues if the matrix is (near) semi- or in-definite.
In that case you can use the LDLT decomposition instead. For an input A it computes a permutation P, a unit-diagonal triangular L and a diagonal D, such that
A = P.T*L*D*L.T*P
Then instead of multiplying Y = L * X you need of course Y = sqrt(D) * L * X, where sqrt(D) is an element-wise sqrt (I don't know the python syntax for that).
Note that you can ignore the permutation, since permuting a vector of identically independent distributed random numbers, is still a vector of i.i.d. numbers.
If that still does not work, try using the SelfAdjointEigenSolver-decomposition.
This computes a diagonal matrix of Eigenvalues D and a unitarian matrix V of Eigenvectors, such that
A = V * D * V^{-1}
And you can do essentially the same as above. (Note that for unitarian matrices, V^{-1} is just the adjoint of V, i.e., V^{-1} = V^T in the real-valued case).

Big O notation and measuring time according to it

Suppose we have an algorithm that is of order O(2^n). Furthermore, suppose we multiplied the input size n by 2 so now we have an input of size 2n. How is the time affected? Do we look at the problem as if the original time was 2^n and now it became 2^(2n) so the answer would be that the new time is the power of 2 of the previous time?
Big 0 is not for telling you the actual running time, just how the running time is affected by the size of input. If you double the size of input the complexity is still O(2^n), n is just bigger.
number of elements(n) units of work
1 1
2 4
3 8
4 16
5 32
... ...
10 1024
20 1048576
There's a misunderstanding here about how Big-O relates to execution time.
Consider the following formulas which define execution time:
f1(n) = 2^n + 5000n^2 + 12300
f2(n) = (500 * 2^n) + 6
f3(n) = 500n^2 + 25000n + 456000
f4(n) = 400000000
Each of these functions are O(2^n); that is, they can each be shown to be less than M * 2^n for an arbitrary M and starting n0 value. But obviously, the change in execution time you notice for doubling the size from n1 to 2 * n1 will vary wildly between them (not at all in the case of f4(n)). You cannot use Big-O analysis to determine effects on execution time. It only defines an upper boundary on the execution time (which is not even guaranteed to be the minimum form of the upper bound).
Some related academia below:
There are three notable bounding functions in this category:
O(f(n)): Big-O - This defines a upper-bound.
Ω(f(n)): Big-Omega - This defines a lower-bound.
Θ(f(n)): Big-Theta - This defines a tight-bound.
A given time function f(n) is Θ(g(n)) only if it is also Ω(g(n)) and O(g(n)) (that is, both upper and lower bounded).
You are dealing with Big-O, which is the usual "entry point" to the discussion; we will neglect the other two entirely.
Consider the definition from Wikipedia:
Let f and g be two functions defined on some subset of the real numbers. One writes:
f(x)=O(g(x)) as x tends to infinity
if and only if there is a positive constant M such that for all sufficiently large values of x, the absolute value of f(x) is at most M multiplied by the absolute value of g(x). That is, f(x) = O(g(x)) if and only if there exists a positive real number M and a real number x0 such that
|f(x)| <= M|g(x)| for all x > x0
Going from here, assume we have f1(n) = 2^n. If we were to compare that to f2(n) = 2^(2n) = 4^n, how would f1(n) and f2(n) relate to each other in Big-O terms?
Is 2^n <= M * 4^n for some arbitrary M and n0 value? Of course! Using M = 1 and n0 = 1, it is true. Thus, 2^n is upper-bounded by O(4^n).
Is 4^n <= M * 2^n for some arbitrary M and n0 value? This is where you run into problems... for no constant value of M can you make 2^n grow faster than 4^n as n gets arbitrarily large. Thus, 4^n is not upper-bounded by O(2^n).
See comments for further explanations, but indeed, this is just an example I came up with to help you grasp Big-O concept. That is not the actual algorithmic meaning.
Suppose you have an array, arr = [1, 2, 3, 4, 5].
An example of a O(1) operation would be directly access an index, such as arr[0] or arr[2].
An example of a O(n) operation would be a loop that could iterate through all your array, such as for elem in arr:.
n would be the size of your array. If your array is twice as big as the original array, n would also be twice as big. That's how variables work.
See Big-O Cheat Sheet for complementary informations.

Asymptotic complexity for typical expressions

The increasing order of following functions shown in the picture below in terms of asymptotic complexity is:
(A) f1(n); f4(n); f2(n); f3(n)
(B) f1(n); f2(n); f3(n); f4(n);
(C) f2(n); f1(n); f4(n); f3(n)
(D) f1(n); f2(n); f4(n); f3(n)
a)time complexity order for this easy question was given as--->(n^0.99)*(logn) < n ......how? log might be a slow growing function but it still grows faster than a constant
b)Consider function f1 suppose it is f1(n) = (n^1.0001)(logn) then what would be the answer?
whenever there is an expression which involves multiplication between logarithimic and polynomial expression , does the logarithmic function outweigh the polynomial expression?
c)How to check in such cases suppose
1)(n^2)logn vs (n^1.5) which has higher time complexity?
2) (n^1.5)logn vs (n^2) which has higher time complexity?
If we consider C_1 and C_2 such that C_1 < C_2, then we can say the following with certainty
(n^C_2)*log(n) grows faster than (n^C_1)
This is because
(n^C_1) grows slower than (n^C_2) (obviously)
also, for values of n larger than 2 (for log in base 2), log(n) grows faster than
1.
in fact, log(n) is asymptotically greater than any constant C,
because log(n) -> inf as n -> inf
if both (n^C_2) is asymptotically than (n^C_1) AND log(n) is asymptotically greater
than 1, then we can certainly say that
(n^2)log(n) has greater complexity than (n^1.5)
We think of log(n) as a "slowly growing" function, but it still grows faster than 1, which is the key here.
coder101 asked an interesting question in the comments, essentially,
is n^e = Ω((n^c)*log_d(n))?
where e = c + ϵ for arbitrarily small ϵ
Let's do some algebra.
n^e = (n^c)*(n^ϵ)
so the question boils down to
is n^ϵ = Ω(log_d(n))
or is it the other way around, namely:
is log_d(n) = Ω(n^ϵ)
In order to do this, let us find the value of ϵ that satisfies n^ϵ > log_d(n).
n^ϵ > log_d(n)
ϵ*ln(n) > ln(log_d(n))
ϵ > ln(log_d(n)) / ln(n)
Because we know for a fact that
ln(n) * c > ln(ln(n)) (1)
as n -> infinity
We can say that, for an arbitrarily small ϵ, there exists an n large enough to
satisfy ϵ > ln(log_d(n)) / ln(n)
because, by (1), ln(log_d(n)) / ln(n) ---> 0 as n -> infinity.
With this knowledge, we can say that
is n^ϵ = Ω(log_d(n))
for arbitrarily small ϵ
which means that
n^(c + ϵ) = Ω((n^c)*log_d(n))
for arbitrarily small ϵ.
in layperson's terms
n^1.1 > n * ln(n)
for some n
also
n ^ 1.001 > n * ln(n)
for some much, much bigger n
and even
n ^ 1.0000000000000001 > n * ln(n)
for some very very big n.
Replacing f1 = (n^0.9999)(logn) by f1 = (n^1.0001)(logn) will yield answer (C): n, (n^1.0001)(logn), n^2, 1.00001^n
The reasoning is as follows:
. (n^1.0001)(logn) has higher complexity than n, obvious.
. n^2 higher than (n^1.0001)(logn) because the polynomial part asymptotically dominates the logarithmic part, so the higher-degree polynomial n^2 wins
. 1.00001^n dominates n^2 because the 1.00001^n has exponential growth, while n^2 has polynomial growth. Exponential growth asymptotically wins.
BTW, 1.00001^n looks a little similar to a family called "sub-exponential" growth, usually denoted (1+Ɛ)^n. Still, whatever small is Ɛ, sub-exponential growth still dominates any polynomial growth.
The complexity of this problem lays between f1(n) and f2(n).
For f(n) = n ^ c where 0 < c < 1, the curve growth will eventually be so slow that it would become so trivial compared with a linear growth curve.
For f(n) = logc(n), where c > 1, the curve growth will eventually be so slow that it would become so trivial compared with a linear growth curve.
The product of such two functions will also eventually become trivial compared with a linear growth curve.
Hence, Theta(n ^ c * logc(n)) is asymptotically less complex than Theta(n).