Can I somehow optimize this formula? I evaluate it many times and it takes much time...
w - 1xN double
phis - NxN double
x - Nx2 double
sum(w(ones([size(x, 1) 1]),:) .* phis, 2)
You're taking the scalar product of each row of phis with w. You can do this easily using linear algebra.
out = phis * w';
This matrix multiplication saves you calls to sum, ones, and size, which should make your code a lot faster. Furthermore, linear algebra operations are often very fast in Matlab, since that's what the program is historically optimized for.
Related
Let f: R -> R be an infinitely differentiable function. What is the computational complexity of calculating the first n derivatives of f in Jax? Naive chain rule would suggest that each multiplication gives a factor of 2 increase, hence the nth derivative would require at least 2^n more operations. I imagine though that clever manipulation of formal series would reduce the number of required calculations and eliminate duplications, esspecially if the derivaives are Jax jitted? Is there a different between the Jax, Tensorflow and Torch implementations?
https://openreview.net/forum?id=SkxEF3FNPH discusses this topic, but doesn t provide a computational complexity.
What is the computational complexity of calculating the first n derivatives of f in Jax?
There's not much you can say in general about computational complexity of Nth derivatives. For example, with a function like jnp.sin, the Nth derivative is O[1], oscillating between negative and positive sin and cos calls as N grows. For an order-k polynomial, the Nth derivative is O[0] for N > k. Other functions may have complexity that is linear or polynomial or even exponential with N depending on the operations they contain.
I imagine though that clever manipulation of formal series would reduce the number of required calculations and eliminate duplications, esspecially if the derivaives are Jax jitted
You imagine correctly! One implementation of this idea is the jax.experimental.jet module, which is an experimental transform designed for computing higher-order derivatives efficiently and accurately. It doesn't cover all JAX functions, but it may be complete enough to do what you have in mind.
I've come across a from scratch implementation for gaussian processes:
http://krasserm.github.io/2018/03/19/gaussian-processes/
There, the isotropic squared exponential kernel is implemented in numpy. It looks like:
The implementation is:
def kernel(X1, X2, l=1.0, sigma_f=1.0):
sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)
consistent with the implementation of Nando de Freitas: https://www.cs.ubc.ca/~nando/540-2013/lectures/gp.py
However, I am not quite sure how this implementation matches the provided formula, especially in the sqdist part. In my opinion, it is wrong but it works (and delivers the same results as scipy's cdist with squared euclidean distance). Why do I think it is wrong? If you multiply out the multiplication of the two matrices, you get
which equals to either a scalar or a nxn matrix for a vector x_i, depending on whether you define x_i to be a column vector or not. The implementation however gives back a nx1 vector with the squared values.
I hope that anyone can shed light on this.
I found out: The implementation is correct. I just was not aware of the fuzzy notation (in my opinion) which is sometimes used in ML contexts. What is to be achieved is a distance matrix and each row vectors of matrix A are to be compared with the row vectors of matrix B to infer the covariance matrix, not (as I somehow guessed) the direct distance between two matrices/vectors.
I have a MxN matrix and a length N vector. What I want to do is to subtract this vector from each of M rows of the matrix. The obvious way to do this is to tf.tile the vector, but this seems highly inefficient because of all the new memory allocation: tiling takes up to 4x as much time than the actualy subtraction.
Is there any way to do it more efficiently, or will I have to write my own operation in C++?
I am computing a similarity matrix based on Euclidean distance in MATLAB. My code is as follows:
for i=1:N % M,N is the size of the matrix x for whose elements I am computing similarity matrix
for j=1:N
D(i,j) = sqrt(sum(x(:,i)-x(:,j)).^2)); % D is the similarity matrix
end
end
Can any help with optimizing this = reducing the for loops as my matrix x is of dimension 256x30000.
Thanks a lot!
--Aditya
The function to do so in matlab is called pdist. Unfortunately it is painfully slow and doesnt take Matlabs vectorization abilities into account.
The following is code I wrote for a project. Let me know what kind of speed up you get.
Qx=repmat(dot(x,x,2),1,size(x,1));
D=sqrt(Qx+Qx'-2*x*x');
Note though that this will only work if your data points are in the rows and your dimensions the columns. So for example lets say I have 256 data points and 100000 dimensions then on my mac using x=rand(256,100000) and the above code produces a 256x256 matrix in about half a second.
There's probably a better way to do it, but the first thing I noticed was that you could cut the runtime in half by exploiting the symmetry D(i,j)==D(i,j)
You can also use the function norm(x(:,i)-x(:,j),2)
I think this is what you're looking for.
D=zeros(N);
jIndx=repmat(1:N,N,1);iIndx=jIndx'; %'# fix SO's syntax highlighting
D(:)=sqrt(sum((x(iIndx(:),:)-x(jIndx(:),:)).^2,2));
Here, I have assumed that the distance vector, x is initalized as an NxM array, where M is the number of dimensions of the system and N is the number of points. So if your ordering is different, you'll have to make changes accordingly.
To start with, you are computing twice as much as you need to here, because D will be symmetric. You don't need to calculate the (i,j) entry and the (j,i) entry separately. Change your inner loop to for j=1:i, and add in the body of that loop D(j,i)=D(i,j);
After that, there's really not much redundancy left in what that code does, so your only room left for improvement is to parallelize it: if you have the Parallel Computing Toolbox, convert your outer loop to a parfor and before you run it, say matlabpool(n), where n is the number of threads to use.
Greetings. I'm trying to approximate the function
Log10[x^k0 + k1], where .21 < k0 < 21, 0 < k1 < ~2000, and x is integer < 2^14.
k0 & k1 are constant. For practical purposes, you can assume k0 = 2.12, k1 = 2660. The desired accuracy is 5*10^-4 relative error.
This function is virtually identical to Log[x], except near 0, where it differs a lot.
I already have came up with a SIMD implementation that is ~1.15x faster than a simple lookup table, but would like to improve it if possible, which I think is very hard due to lack of efficient instructions.
My SIMD implementation uses 16bit fixed point arithmetic to evaluate a 3rd degree polynomial (I use least squares fit). The polynomial uses different coefficients for different input ranges. There are 8 ranges, and range i spans (64)2^i to (64)2^(i + 1).
The rational behind this is the derivatives of Log[x] drop rapidly with x, meaning a polynomial will fit it more accurately since polynomials are an exact fit for functions that have a derivative of 0 beyond a certain order.
SIMD table lookups are done very efficiently with a single _mm_shuffle_epi8(). I use SSE's float to int conversion to get the exponent and significand used for the fixed point approximation. I also software pipelined the loop to get ~1.25x speedup, so further code optimizations are probably unlikely.
What I'm asking is if there's a more efficient approximation at a higher level?
For example:
Can this function be decomposed into functions with a limited domain like
log2((2^x) * significand) = x + log2(significand)
hence eliminating the need to deal with different ranges (table lookups). The main problem I think is adding the k1 term kills all those nice log properties that we know and love, making it not possible. Or is it?
Iterative method? don't think so because the Newton method for log[x] is already a complicated expression
Exploiting locality of neighboring pixels? - if the range of the 8 inputs fall in the same approximation range, then I can look up a single coefficient, instead of looking up separate coefficients for each element. Thus, I can use this as a fast common case, and use a slower, general code path when it isn't. But for my data, the range needs to be ~2000 before this property hold 70% of the time, which doesn't seem to make this method competitive.
Please, give me some opinion, especially if you're an applied mathematician, even if you say it can't be done. Thanks.
You should be able to improve on least-squares fitting by using Chebyshev approximation. (The idea is, you're looking for the approximation whose worst-case deviation in a range is least; least-squares instead looks for the one whose summed squared difference is least.) I would guess this doesn't make a huge difference for your problem, but I'm not sure -- hopefully it could reduce the number of ranges you need to split into, somewhat.
If there's already a fast implementation of log(x), maybe compute P(x) * log(x) where P(x) is a polynomial chosen by Chebyshev approximation. (Instead of trying to do the whole function as a polynomial approx -- to need less range-reduction.)
I'm an amateur here -- just dipping my toe in as there aren't a lot of answers already.
One observation:
You can find an expression for how large x needs to be as a function of k0 and k1, such that the term x^k0 dominates k1 enough for the approximation:
x^k0 +k1 ~= x^k0, allowing you to approximately evaluate the function as
k0*Log(x).
This would take care of all x's above some value.
I recently read how the sRGB model compresses physical tri stimulus values into stored RGB values.
It basically is very similar to the function I try to approximate, except that it's defined piece wise:
k0 x, x < 0.0031308
k1 x^0.417 - k2 otherwise
I was told the constant addition in Log[x^k0 + k1] was to make the beginning of the function more linear. But that can easily be achieved with a piece wise approximation. That would make the approximation a lot more "uniform" - with only 2 approximation ranges. This should be cheaper to compute due to no longer needing to compute an approximation range index (integer log) and doing SIMD coefficient lookup.
For now, I conclude this will be the best approach, even though it doesn't approximate the function precisely. The hard part will be proposing this change and convincing people to use it.