When not to vectorize matlab? - optimization

I'm working on some matlab code which is processing large (but not huge) datasets: 10,000 784 element vectors (not sparse), and calculating information about that which is stored in a 10,000x10 sparse matrix. In order to get the code working I did some of the trickier parts iteratively, doing loops over the 10k items to process them, and a few a loop over the 10 items in the sparse matrix for cleanup.
My process initially took 73 iterations (so, on the order of 730k loops) to process, and ran in about 120 seconds. Not bad, but this is matlab, so I set out to vectorize it to speed it up.
In the end I have a fully vectorized solution which gets the same answer (so it's correct, or at least as correct as my initial solution), but takes 274 seconds to run, it's almost half as fast!
This is the first time I've ran into matlab code which runs slower vectorized than it does iteratively. Are there any rules of thumb or best practices for identifying when this is likely / possible?
I'd love to share the code for some feedback, but it's for a currently open school assignment so I really can't right now. If it ends up being one of those "Wow, that's weird, you probably did something wrong things" I'll probably revisit this in a week or two to see if my vectorization is somehow off.

Vectorisation in Matlab often means allocating a lot more memory (making a much larger array to avoid the loop eg by tony's trick). With improved JIT compiling of loops in recent versions - its possible that the memory allocation required for your vectorised solution means there is no advantage, but without seeing the code it's hard to say. Matlab has an excellent line-by-line profiler which should help you see which particular parts of the vectorised version are taking the time.

Have you tried plotting the execution time as a function of problem size (either the number of elements per vector [currently 784], or the number of vectors [currently 10,000])? I ran into a similar anomaly when vectorizing a Gram-Schmidt orthogonalization algorithm; it turned out that the vectorized version was faster until the problem grew to a certain size, at which point the iterative version actually ran faster, as seen in this plot:
Here are the two implementations and the benchmarking script:
clgs.m
function [Q,R] = clgs(A)
% QR factorization by unvectorized classical Gram-Schmidt orthogonalization
[m,n] = size(A);
R = zeros(n,n); % pre-allocate upper-triangular matrix
% iterate over columns
for j = 1:n
v = A(:,j);
% iterate over remaining columns
for i = 1:j-1
R(i,j) = A(:,i)' * A(:,j);
v = v - R(i,j) * A(:,i);
end
R(j,j) = norm(v);
A(:,j) = v / norm(v); % normalize
end
Q = A;
clgs2.m
function [Q,R] = clgs2(A)
% QR factorization by classical Gram-Schmidt orthogonalization with a
% vectorized inner loop
[m,n] = size(A);
R = zeros(n,n); % pre-allocate upper-triangular matrix
for k=1:n
R(1:k-1,k) = A(:,1:k-1)' * A(:,k);
A(:,k) = A(:,k) - A(:,1:k-1) * R(1:k-1,k);
R(k,k) = norm(A(:,k));
A(:,k) = A(:,k) / R(k,k);
end
Q = A;
benchgs.m
n = [300,350,400,450,500];
clgs_time=zeros(length(n),1);
clgs2_time=clgs_time;
for i = 1:length(n)
A = rand(n(i));
tic;
[Q,R] = clgs(A);
clgs_time(i) = toc;
tic;
[Q,R] = clgs2(A);
clgs2_time(i) = toc;
end
semilogy(n,clgs_time,'b',n,clgs2_time,'r')
xlabel 'n', ylabel 'Time [seconds]'
legend('unvectorized CGS','vectorized CGS')

To answer the question "When not to vectorize MATLAB code" more generally:
Don't vectorize code if the vectorization is not straight forward and makes the code very hard to read. This is under the assumption that
Other people than you might need to read and understand it.
The unvectorized code is fast enough for what you need.

This won't be a very specific answer, but I deal with extremely large datasets (4D cardiac datasets).
There are occasions where I need to perform an operation that involves a number of 4D sets. I can either create a loop, or a vectorised operation that essentially works on a concatenated 5D object. (e.g. as a trivial example, say you wanted to get the average 4D object, you could either create a loop collecting a walking-average, or concatenate in the 5th dimension, and use the mean function over it).
In my experience, putting aside the time it will take to create the 5D object in the first place, presumably due to the sheer size and memory access leaps involved when performing calculations, it is usually a lot faster to resort to a loop of the still large, but a lot more manageable 4D objects.
The other "microoptimisation" trick I will point out is that matlab is "column major order". Meaning, for my trivial example, I believe it would be faster to be averaging along the 1st dimension, rather than the 5th one, as the former involves contiguous locations in memory, whereas the latter involves huge jumps, so to speak. So it may be worth storing your megaarray in a dimension-order that has the data you'll be operating on as the first dimension, if that makes sense.
Trivial example to show the difference between operating on rows vs columns:
>> A = randn(10000,10000);
>> tic; for n = 1 : 100; sum(A,1); end; toc
Elapsed time is 12.354861 seconds.
>> tic; for n = 1 : 100; sum(A,2); end; toc
Elapsed time is 22.298909 seconds.

Related

How to calculate a very large correlation matrix

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.

Determine the running time of an algorithm with two parameters

I have implemented an algorithm that uses two other algorithms for calculating the shortest path in a graph: Dijkstra and Bellman-Ford. Based on the time complexity of the these algorithms, I can calculate the running time of my implementation, which is easy giving the code.
Now, I want to experimentally verify my calculation. Specifically, I want to plot the running time as a function of the size of the input (I am following the method described here). The problem is that I have two parameters - number of edges and number of vertices.
I have tried to fix one parameter and change the other, but this approach results in two plots - one for varying number of edges and the other for varying number of vertices.
This leads me to my question - how can I determine the order of growth based on two plots? In general, how can one experimentally determine the running time complexity of an algorithm that has more than one parameter?
It's very difficult in general.
The usual way you would experimentally gauge the running time in the single variable case is, insert a counter that increments when your data structure does a fundamental (putatively O(1)) operation, then take data for many different input sizes, and plot it on a log-log plot. That is, log T vs. log N. If the running time is of the form n^k you should see a straight line of slope k, or something approaching this. If the running time is like T(n) = n^{k log n} or something, then you should see a parabola. And if T is exponential in n you should still see exponential growth.
You can only hope to get information about the highest order term when you do this -- the low order terms get filtered out, in the sense of having less and less impact as n gets larger.
In the two variable case, you could try to do a similar approach -- essentially, take 3 dimensional data, do a log-log-log plot, and try to fit a plane to that.
However this will only really work if there's really only one leading term that dominates in most regimes.
Suppose my actual function is T(n, m) = n^4 + n^3 * m^3 + m^4.
When m = O(1), then T(n) = O(n^4).
When n = O(1), then T(n) = O(m^4).
When n = m, then T(n) = O(n^6).
In each of these regimes, "slices" along the plane of possible n,m values, a different one of the terms is the dominant term.
So there's no way to determine the function just from taking some points with fixed m, and some points with fixed n. If you did that, you wouldn't get the right answer for n = m -- you wouldn't be able to discover "middle" leading terms like that.
I would recommend that the best way to predict asymptotic growth when you have lots of variables / complicated data structures, is with a pencil and piece of paper, and do traditional algorithmic analysis. Or possibly, a hybrid approach. Try to break the question of efficiency into different parts -- if you can split the question up into a sum or product of a few different functions, maybe some of them you can determine in the abstract, and some you can estimate experimentally.
Luckily two input parameters is still easy to visualize in a 3D scatter plot (3rd dimension is the measured running time), and you can check if it looks like a plane (in log-log-log scale) or if it is curved. Naturally random variations in measurements plays a role here as well.
In Matlab I typically calculate a least-squares solution to two-variable function like this (just concatenates different powers and combinations of x and y horizontally, .* is an element-wise product):
x = log(parameter_x);
y = log(parameter_y);
% Find a least-squares fit
p = [x.^2, x.*y, y.^2, x, y, ones(length(x),1)] \ log(time)
Then this can be used to estimate running times for larger problem instances, ideally those would be confirmed experimentally to know that the fitted model works.
This approach works also for higher dimensions but gets tedious to generate, maybe there is a more general way to achieve that and this is just a work-around for my lack of knowledge.
I was going to write my own explanation but it wouldn't be any better than this.

Optimizing Python KD Tree Searches

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.
The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).
Here is my problem:
I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.
With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).
Potential solution #1: Is there a way to parse this list into a text file as it is writing?
Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?
Potential solution #3: Beats me, anyone else have any ideas?
From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:
cKDTree feature-complete
Cython version of KDTree, cKDTree, is now feature-complete. Most
operations (construction, query, query_ball_point, query_pairs,
count_neighbors and sparse_distance_matrix) are between 200 and 1000
times faster in cKDTree than in KDTree. With very minor caveats,
cKDTree has exactly the same interface as KDTree, and can be used as a
drop-in replacement.
Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).
You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:
BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
# write neighbours to a file...

Numpy: Reduce memory footprint of dot product with random data

I have a large numpy array that I am going to take a linear projection of using randomly generated values.
>>> input_array.shape
(50, 200000)
>>> random_array = np.random.normal(size=(200000, 300))
>>> output_array = np.dot(input_array, random_array)
Unfortunately, random_array takes up a lot of memory, and my machine starts swapping. It seems to me that I don't actually need all of random_array around at once; in theory, I ought to be able to generate it lazily during the dot product calculation...but I can't figure out how.
How can I reduce the memory footprint of the calculation of output_array from input_array?
This obviously isn't the fastest solution, but have you tried:
m, inner = input_array.shape
n = 300
out = np.empty((m, n))
for i in xrange(n):
out[:, i] = np.dot(input_array, np.random.normal(size=inner))
This might be a situation where using cython could reduce your memory usage. You could generate the random numbers on the fly and accumulate the result as you go. I don't have the time to write and test the full function, but you would definitely want to use randomkit (the library that numpy uses under the hood) at the c-level.
You can take a look at some example code I wrote for another application to see how to wrap randomkit:
https://github.com/synapticarbors/pylangevin-integrator/blob/master/cIntegrator.pyx
And also check out how matrix multiplication is implemented in the following paper on cython:
http://conference.scipy.org/proceedings/SciPy2009/paper_2/full_text.pdf
Instead of having both arrays as inputs, just have input_array as one, and then in the method, generate small chunks of the random array as you go.
Sorry if it is just a sketch instead of actual code, but hopefully it is enough to get you started.

Optimize MATLAB code (nested for loop to compute similarity matrix)

I am computing a similarity matrix based on Euclidean distance in MATLAB. My code is as follows:
for i=1:N % M,N is the size of the matrix x for whose elements I am computing similarity matrix
for j=1:N
D(i,j) = sqrt(sum(x(:,i)-x(:,j)).^2)); % D is the similarity matrix
end
end
Can any help with optimizing this = reducing the for loops as my matrix x is of dimension 256x30000.
Thanks a lot!
--Aditya
The function to do so in matlab is called pdist. Unfortunately it is painfully slow and doesnt take Matlabs vectorization abilities into account.
The following is code I wrote for a project. Let me know what kind of speed up you get.
Qx=repmat(dot(x,x,2),1,size(x,1));
D=sqrt(Qx+Qx'-2*x*x');
Note though that this will only work if your data points are in the rows and your dimensions the columns. So for example lets say I have 256 data points and 100000 dimensions then on my mac using x=rand(256,100000) and the above code produces a 256x256 matrix in about half a second.
There's probably a better way to do it, but the first thing I noticed was that you could cut the runtime in half by exploiting the symmetry D(i,j)==D(i,j)
You can also use the function norm(x(:,i)-x(:,j),2)
I think this is what you're looking for.
D=zeros(N);
jIndx=repmat(1:N,N,1);iIndx=jIndx'; %'# fix SO's syntax highlighting
D(:)=sqrt(sum((x(iIndx(:),:)-x(jIndx(:),:)).^2,2));
Here, I have assumed that the distance vector, x is initalized as an NxM array, where M is the number of dimensions of the system and N is the number of points. So if your ordering is different, you'll have to make changes accordingly.
To start with, you are computing twice as much as you need to here, because D will be symmetric. You don't need to calculate the (i,j) entry and the (j,i) entry separately. Change your inner loop to for j=1:i, and add in the body of that loop D(j,i)=D(i,j);
After that, there's really not much redundancy left in what that code does, so your only room left for improvement is to parallelize it: if you have the Parallel Computing Toolbox, convert your outer loop to a parfor and before you run it, say matlabpool(n), where n is the number of threads to use.