Vector functions of STREAM benchmark - testing

I am currently doing a small research project for school, where I am to test the memory performance bandwidth of a Hypervisor, compared to the virtualised machines it creates and manages.
Due to the timeframe of the project, only one of the vector functions tested by STREAM will be analysed. My thoughtprocess is to look at the results from the "Copy" function, since this is the most basic function, which performs no arithmetic, as stated at the bottom of
After all, this is a memory bandwidth performance test.
I have yet though to find any google post that proves, or disproves my theory. Is there anyone here who can shine some light on this topic?

STREAM Copy and other three tests are usually written in plain C without explicit vectorization. But the loops are simple and most compilers are able to optimize them to vectorized variant. The kernel line in is the full code of loop, and there are three arrays: a, b, c of the same size; preinitialized with some floating point data. Element of vector is double (8 bytes typical).
The table below shows how many Bytes and FLOPs are counted in each iteration of the STREAM loops.
The test consists of multiple repetitions of four the kernels, and the best results of (typically) 10 trials are chosen.
name kernel bytes/iter FLOPS/iter
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
More recent variants of the test are NERSC: and HPCC: both based on


Efficiently implementing DXT1 texture decompression in hardware

DXT1 compression is designed to be fast to decompress in hardware where its used in texture samplers. The Wikipedia article says that under certain circumstances you can work out the co-efficients of the interpolated colours as:
c2 = (2/3)*c0+(1/3)*c1
or rearranging that:
c2 = (1/3)*(2*c0+c1)
However you re-arrange the above equation, then you end up always having to multiply something by 1/3 (or dividing by 3, same deal even more expensive). And it seems weird to me that a texture format which is designed to be fast to decompress in hardware would require a multiplication or division. The FPGA I'm implementing my GPU on only has limited resources for multiplications and I want to save those for where they're really required.
So am I missing something? Is there an efficient way of avoiding the multiplications of the colour channels by a 1/3? Or should I just eat the cost of that multiplication?
This might be a bad way of imagining it, but could you implement it via the use of addition/subtraction of successive halves (shifts)?
As you have 16 bits this gives you the ability to get quite accurate with successive additions and subtractions.
A third could be represented as
a(n+1) = a(n) +/- A>>1, where, the list [0, 0, 1, 0, 1, etc] shows whether to add or subtract the shifted result.
I believe this is called fractional maths.
However, in FPGAs, it is difficult to know whether this is actually more power efficient than the native DSP blocks (e.g. DSP48E1) provided.
MY best answer I can come up with is that I can use the identity:
x/3 = sum(n=1 to infinity) (x/2^(2n))
and then take the first n terms. Using 4 terms I get:
which equals
which is probably good enough.
This relies on multiplication by a fixed power of 2 being free in hardware, then 3 additions of which I can run 2 in parallel.
Any better answer is appreciated though.
** EDIT **: Using a combination of this and #dyslexicgruffalo's answer I made a simple c++ program which iterated over the various sequences and tried them all and recorded the various average/max errors.
I did this for 0 <= x <= 189 (as 189 is the value of 2*c0.g + c1.g when g (which is 6 bits) maxes out.
The shortest good sequence (with a max error of 2, average error of 0.62) and is 4 ops was:
1 + x/4 + x/16 + x/64.
The best sequence which had a max error of 1, average error of 0.32, but is 6 ops was:
x/2 - x/4 + x/8 - x/16 + x/32 - x/64.
For the 5 bit values (red and blue) the maximum value is 31*3 and the above sequences are still good but not the best. These are:
x/4 + x/8 - x/16 + x/32 [max error of 1, average 0.38]
1 + x/4 + x/16 [max error of 2, average of 0.68]
(And, luckily, none of the above sequences ever guesses an answer which is too big so no clamping is needed even though they're not perfect)

numpy correlation coefficient:, A.T) on large arrays causing seg fault

Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
Not sure that I follow, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] =[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.

Multiply one fixed matrix by a huge number of vectors

I'll need to change the basis of some 10^7 vectors, each having
200 coordinates. So I will multiply one [200 x 200] matrix by 10^7 [200 x 1] vectors. I need it to run very fast but I need to code it fast (one day or less)
and my CUDA is poor, so I don't want to code it from scratch in CUDA or OpenCL. Maybe some existing library can do it for me? Notice that, if the solution uses GPGPU, the matrix should be transfered to the GPU only once, otherwise the performance will be poor. Could I could use OpenACC (or OpenMP, I don't know)? Is it possible to do this in a day?
I prefer open source solutions (for both convenience and ethical reasons) but I can tolerate a closed source solution, even paid (assuming it is not too costly).
This is for my dissertation.
Thank you for your attention.
You can put your vectors in a matrix, 200 * 10^7 is perhaps to much space at once depending on our system, so you can split it.
And then you use any code that is optimized for matrix matrix multiplication, like BLAS. There are many implementations on CPUs, GPUs (cuBLAS, MAGMA,...), multicores (PLASMA,...), or distributed memory.
Since you will have big matrices you vill have a better acceleration than by doing matrix vector multiplications.
You're going to multiply 10 million big vectors by a huge matrix that is the same for all of them.
It would be fastest if all possible decision-making could be compiled-out ahead of time.
In other words, there are lots of index calculations and loop testing that would be identically repeated millions of times.
This sounds like a perfect case for pre-compilation:
Write a small program that would take as input your 200x200 matrix data values, and have it print out a piece of program text defining a function capable of inputting the input vector and outputting the result vector.
It could look something like this:
void multTheMatrixByTheVector(double a[200], double b[200]){
b[0] = 0
+ a[0] * <a constant, the value of mat[0][0]>
+ a[1] * <a constant, the value of mat[1][0]>
+ a[199] * <a constant, the value of mat[199][0]>
b[1] = 0
+ a[0] * <a constant, the value of mat[0][1]>
+ a[1] * <a constant, the value of mat[1][1]>
+ a[199] * <a constant, the value of mat[199][1]>
b[199] = etc. etc.
You see, that function will be around 40000 lines long, but a decent compiler should be able to handle it.
Of course, if any of the matrix elements are zero, i.e. there's some sparsity, you can omit those lines (or let the compiler optimizer do it).
To do this on CUDA or vectorized instructions, you'd have to modify it accordingly, but that should be do-able.
When you include that function in your main program, it should be able to run about as fast as the machine can go.
It's not wasting any cycles doing index calculations, loop testing, or multiplying by empty matrix cells.
Then if it takes 10ns per multiply and add, my back-of-the envelope says it should take 400 usec per vector, or 4000 seconds overall - a little over an hour.

How to optimize OpenCL code for neighbors accessing?

Edit: Proposed solutions results are added at the end of the question.
I'm starting to program with OpenCL, and I have created a naive implementation of my problem.
The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.
My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, which have an "int neighbors 6 " where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.
Sample code:
typedef struct{
float4 var1;
float4 var2;
float4 nextStepVar1;
int neighbors[8];
int var3;
int nextStepVar2;
bool var4;
} Element;
__kernel void step1(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
Element elem = elements[id];
for (int i=0; i < 6; ++i){
if (elem.neighbors[i] != -1){
//Gather information of the neighbor and accumulate it in elem.nextStepVars
elements[id] = elem;
__kernel void step2(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
Element elem = elements[id];
//update elem variables by using elem.nextStepVariables
//restart elem.nextStepVariables
Right now, my OpenCL implementation takes basically the same time than my C++ implementation.
So, the question is: How would you (the experts :P) address this problem?
I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.
Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?
Edit: Third and fifth optimizations proposed by Huseyin Tugrul were already in the code. As mentioned here, to make structs behave properly, they need to satisfy some restrictions, so it is worth understanding that to avoid headaches.
Edit 1: Applying the seventh optimization proposed by Huseyin Tugrul performance increased from 7 fps to 60 fps. In a more general experimentation, the performance gain was about x8.
Edit 2: Applying the first optimization proposed by Huseyin Tugrul performance increased about x1.2 . I think that the real gain is higher, but hides because of another bottleneck not yet solved.
Edit 3: Applying the 8th and 9th optimizations proposed by Huseyin Tugrul didn't change performance, because of the lack of significant code taking advantage of these optimizations, worth trying in other kernels though.
Edit 4: Passing invariant arguments (such as n_elements or workgroupsize) to the kernels as #DEFINEs instead of kernel args, as mentioned here, increased performance around x1.33. As explained in the document, this is because of the aggressive optimizations that the compiler can do when knowing the variables at compile-time.
Edit 5: Applying the second optimization proposed by Huseyin Tugrul, but using 1 bit per neighbor and using bitwise operations to check if neighbor is present (so, if neighbors & 1 != 0, top neighbor is present, if neighbors & 2 != 0, bot neighbor is present, if neighbors & 4 != 0, right neighbor is present, etc), increased performance by a factor of x1.11. I think this was mostly because of the data transfer reduction, because the data movement was, and keeps being my bottleneck. Soon I will try to get rid of the dummy variables used to add padding to my structs.
Edit 6: By eliminating the structs that I was using, and creating separated buffers for each property, I eliminated the padding variables, saving space, and was able to optimize the global memory access and local memory allocation. Performance increased by a factor of x1.25, which is very good. Worth doing this, despite the programmatic complexity and unreadability.
According to your step1 and step2, you are not making your gpu core work hard. What is your kernel's complexity? What is your gpu usage? Did you check with monitoring programs like afterburner? Mid-range desktop gaming cards can get 10k threads each doing 10k iterations.
Since you are working with only neighbours, data size/calculation size is too big and your kernels may be bottlenecked by vram bandiwdth. Your main system ram could be as fast as your pci-e bandwidth and this could be the issue.
1) Use of Dedicated Cache could be getting you thread's actual grid cell into private registers that is fastest. Then neighbours into __local array so the comparisons/calc only done in chip.
Load current cell into __private
Load neighbours into __local
start looping for local array
get next neighbour into __private from __local
end loop
(if it has many neighbours, lines after "Load neighbours into __local" can be in another loop that gets from main memory by patches)
What is your gpu? Nice it is GTX660. You should have 64kB controllable cache per compute unit. CPUs have only registers of 1kB and not addressable for array operations.
2) Shorter Indexing could be using a single byte as index of neighbour stored instead of int. Saving precious L1 cache space from "id" fetches is important so that other threads can hit L1 cache more!
0=neighbour from left
1=neighbour from right
2=neighbour from up
3=neighbour from down
4=neighbour from front
5=neighbour from back
6=neighbour from upper left
so you can just derive neighbour index from a single byte instead of 4-byte int which decreases main memory accessing for at least neighbour accessing. Your kernel will derive neighbour index from upper table using its compute power, not memory power because you would make this from core registers(__privates). If your total grid size is constant, this is very easy such as just adding 1 actual cell id, adding 256 to id or adding 256*256 to id or so.
3) Optimum Object Size could be making your struct/cell-object size a multiple of 4 bytes. If your total object size is around 200-bytes, you can pad it or augment it with some empty bytes to make exactly 200 bytes, 220Bytes or 256 bytes.
4) Branchless Code (Edit: depends!) using less if-statements. Using if-statement makes computation much slower. Rather than checking for -1 as end of neightbour index , you can use another way . Becuase lightweight core are not as capable of heavyweight. You can use surface-buffer-cells to wrap the surface so computed-cells will have always have 6-neighbours so you get rid of if (elem.neighbors[i] != -1) . Worth a try especially for GPU.
Just computing all neighbours are faster rather than doing if-statement. Just multiply the result change with zero when it is not a valid neighbour. How can we know that it is not a valid neighbour? By using a byte array of 6-elements per cell(parallel to neighbour id array)(invalid=0, valid=1 -->multiply the result with this)
The if-statement is inside a loop which counting for six times. Loop unrolling gives similar speed-up if the workload in the loop is relatively easy.
But, if all threads within same warp goes into same if-or-else branch, they don't lose performance. So this depends wheter your code diverges or not.
5) Data Elements Reordering you can move the int[8] element to uppermost side of struct so memory accessing may become more yielding so smaller sized elements to lower side can be read in a single read-operation.
6) Size of Workgroup trying different local workgroup size can give 2-3x performance. Starting from 16 until 512 gives different results. For example, AMD GPUs like integer multiple of 64 while NVIDIA GPUs like integer multiple of 32. INTEL does fine at 8 to anything since it can meld multiple compute units together to work on same workgroup.
7) Separation of Variables(only if you cant get rid of if-statements) Separation of comparison elements from struct. This way you dont need to load a whole struct from main memory just to compare an int or a boolean. When comparison needs, then loads the struct from main memory(if you have local mem optimization already, then you should put this operation before it so loading into local mem is only done for selected neighbours)
This optimisation makes best case(no neighbour or only one eighbour) considerably faster. Does not affect worst case(maximum neighbours case).
8a) Magic Using shifting instead of dividing by power of 2. Doing similar for modulo. Putting "f" at the end of floating literals(1.0f instead of 1.0) to avoid automatic conversion from double to float.
8b) Magic-2 -cl-mad-enable Compiler option can increase multiply+add operation speed.
9) Latency Hiding Execution configuration optimization. You need to hide memory access latency and take care of occupancy.
Get maximum cycles of latency for instructions and global memory access.
Then divide memory latency by instruction latency.
Now you have the ratio of: arithmetic instruction number per memory access to hide latency.
If you have to use N instructions to hide mem latency and you have only M instructions in your code, then you will need N/M warps(wavefronts?) to hide latency because a thread in gpu can do arithmetics while other thread getting things from mem.
10) Mixed Type Computing After memory access is optimized, swap or move some instructions where applicable to get better occupancy, use half-type to help floating point operations where precision is not important.
11) Latency Hiding again Try your kernel code with only arithmetics(comment out all mem accesses and initiate them with 0 or sometihng you like) then try your kernel code with only memory access instructions(comment out calculations/ ifs)
Compare kernel times with original kernel time. Which is affeecting the originatl time more? Concentrate on that..
12) Lane & Bank Conflicts Correct any LDS-lane conflicts and global memory bank conflicts because same address accessings can be done in a serialed way slowing process(newer cards have broadcast ability to reduce this)
13) Using registers Try to replace any independent locals with privates since your GPU can give nearly 10TB/s throughput using registers.
14) Not Using Registers Dont use too many registers or they will spill to global memory and slow the process.
15) Minimalistic Approach for Occupation Look at local/private usage to get an idea of occupation. If you use much more local and privates then less threads can be utilized in same compute unit and leading lesser occupation. Less resource usage leads higher chance of occupation(if you have enough total threads)
16) Gather Scatter When neighbours are different particles(like an nbody NNS) from random addresses of memory, its maybe hard to apply but, gather read optimization can give 2x-3x speed on top of before optimizations (needs local memory optimization to work) so it reads in an order from memory instead of randomly and reorders as needed in the local memory to share between (scatter) to threads.
17) Divide and Conquer Just in case when buffer is too big and copied between host and device so makes gpu wait idle, then divide it in two, send them separately, start computing as soon as one arrives, send results back concurrently in the end. Even a process-level parallelism could push a gpu to its limits this way. Also L2 cache of GPU may not be enough for whole of data. Cache-tiled computing but implicitly done instead of direct usage of local memory.
18) Bandwidth from memory qualifiers. When kernel needs some extra 'read' bandwidth, you can use '__constant'(instead of __global) keyword on some parameters which are less in size and only for reading. If those parameters are too large then you can still have good streaming from '__read_only' qualifier(after the '__global' qualifier). Similary '__write_only' increases throughput but these give mostly hardware-specific performance. If it is Amd's HD5000 series, constant is good. Maybe GTX660 is faster with its cache so __read_only may become more usable(or Nvidia using cache for __constant?).
Have three parts of same buffer with one as __global __read_only, one as __constant and one as just __global (if building them doesn't penalty more than reads' benefits).
Just tested my card using AMD APP SDK examples, LDS bandwidth shows 2TB/s while constant is 5TB/s(same indexing instead of linear/random) and main memory is 120 GB/s.
Also don't forget to add restrict to kernel parameters where possible. This lets compiler do more optimizations on them(if you are not aliasing them).
19) Modern hardware transcendental functions are faster than old bit hack (like Quake-3 fast inverse square root) versions
20) Now there is Opencl 2.0 which enables spawning kernels inside kernels so you can further increase resolution in a 2d grid point and offload it to workgroup when needed (something like increasing vorticity detail on edges of a fluid dynamically)
A profiler can help for all those, but any FPS indicator can do if only single optimization is done per step.
Even if benchmarking is not for architecture-dependent code paths, you could try having a multiple of 192 number of dots per row in your compute space since your gpu has multiple of that number of cores and benchmark that if it makes gpu more occupied and have more gigafloatingpoint operations per second.
There must be still some room for optimization after all these options, but idk if it damages your card or feasible for production time of your projects. For example:
21) Lookup tables When there is 10% more memory bandwidth headroom but no compute power headroom, offload 10% of those workitems to a LUT version such that it gets precomputed values from a table. I didn't try but something like this should work:
8 compute groups
2 LUT groups
8 compute groups
2 LUT groups
so they are evenly distributed into "threads in-flight" and get advantage of latency hiding stuff. I'm not sure if this is a preferable way of doing science.
21) Z-order pattern For traveling neighbors increases cache hit rate. Cache hit rate saves some global memory bandwidth for other jobs so that overall performance increases. But this depends on size of cache, data layout and some other things I don't remember.
22) Asynchronous Neighbor Traversal
iteration-1: Load neighbor 2 + compute neighbor 1 + store neighbor 0
iteration-2: Load neighbor 3 + compute neighbor 2 + store neighbor 1
iteration-3: Load neighbor 4 + compute neighbor 3 + store neighbor 2
so each body of loop doesn't have any chain of dependency and fully pipelined on GPU processing elements and OpenCL has special instructions for asynchronously loading/storing global variables using all cores of a workgroup. Check this:
Maybe you can even divide computing part into two and have one part use transcandental functions and other part use add/multiply so that add/multiply operations don't wait for a slow sqrt. If there are at least several neighbors to traveerse, this should hide some latency behind other iterations.

When not to vectorize matlab?

I'm working on some matlab code which is processing large (but not huge) datasets: 10,000 784 element vectors (not sparse), and calculating information about that which is stored in a 10,000x10 sparse matrix. In order to get the code working I did some of the trickier parts iteratively, doing loops over the 10k items to process them, and a few a loop over the 10 items in the sparse matrix for cleanup.
My process initially took 73 iterations (so, on the order of 730k loops) to process, and ran in about 120 seconds. Not bad, but this is matlab, so I set out to vectorize it to speed it up.
In the end I have a fully vectorized solution which gets the same answer (so it's correct, or at least as correct as my initial solution), but takes 274 seconds to run, it's almost half as fast!
This is the first time I've ran into matlab code which runs slower vectorized than it does iteratively. Are there any rules of thumb or best practices for identifying when this is likely / possible?
I'd love to share the code for some feedback, but it's for a currently open school assignment so I really can't right now. If it ends up being one of those "Wow, that's weird, you probably did something wrong things" I'll probably revisit this in a week or two to see if my vectorization is somehow off.
Vectorisation in Matlab often means allocating a lot more memory (making a much larger array to avoid the loop eg by tony's trick). With improved JIT compiling of loops in recent versions - its possible that the memory allocation required for your vectorised solution means there is no advantage, but without seeing the code it's hard to say. Matlab has an excellent line-by-line profiler which should help you see which particular parts of the vectorised version are taking the time.
Have you tried plotting the execution time as a function of problem size (either the number of elements per vector [currently 784], or the number of vectors [currently 10,000])? I ran into a similar anomaly when vectorizing a Gram-Schmidt orthogonalization algorithm; it turned out that the vectorized version was faster until the problem grew to a certain size, at which point the iterative version actually ran faster, as seen in this plot:
Here are the two implementations and the benchmarking script:
function [Q,R] = clgs(A)
% QR factorization by unvectorized classical Gram-Schmidt orthogonalization
[m,n] = size(A);
R = zeros(n,n); % pre-allocate upper-triangular matrix
% iterate over columns
for j = 1:n
v = A(:,j);
% iterate over remaining columns
for i = 1:j-1
R(i,j) = A(:,i)' * A(:,j);
v = v - R(i,j) * A(:,i);
R(j,j) = norm(v);
A(:,j) = v / norm(v); % normalize
Q = A;
function [Q,R] = clgs2(A)
% QR factorization by classical Gram-Schmidt orthogonalization with a
% vectorized inner loop
[m,n] = size(A);
R = zeros(n,n); % pre-allocate upper-triangular matrix
for k=1:n
R(1:k-1,k) = A(:,1:k-1)' * A(:,k);
A(:,k) = A(:,k) - A(:,1:k-1) * R(1:k-1,k);
R(k,k) = norm(A(:,k));
A(:,k) = A(:,k) / R(k,k);
Q = A;
n = [300,350,400,450,500];
for i = 1:length(n)
A = rand(n(i));
[Q,R] = clgs(A);
clgs_time(i) = toc;
[Q,R] = clgs2(A);
clgs2_time(i) = toc;
xlabel 'n', ylabel 'Time [seconds]'
legend('unvectorized CGS','vectorized CGS')
To answer the question "When not to vectorize MATLAB code" more generally:
Don't vectorize code if the vectorization is not straight forward and makes the code very hard to read. This is under the assumption that
Other people than you might need to read and understand it.
The unvectorized code is fast enough for what you need.
This won't be a very specific answer, but I deal with extremely large datasets (4D cardiac datasets).
There are occasions where I need to perform an operation that involves a number of 4D sets. I can either create a loop, or a vectorised operation that essentially works on a concatenated 5D object. (e.g. as a trivial example, say you wanted to get the average 4D object, you could either create a loop collecting a walking-average, or concatenate in the 5th dimension, and use the mean function over it).
In my experience, putting aside the time it will take to create the 5D object in the first place, presumably due to the sheer size and memory access leaps involved when performing calculations, it is usually a lot faster to resort to a loop of the still large, but a lot more manageable 4D objects.
The other "microoptimisation" trick I will point out is that matlab is "column major order". Meaning, for my trivial example, I believe it would be faster to be averaging along the 1st dimension, rather than the 5th one, as the former involves contiguous locations in memory, whereas the latter involves huge jumps, so to speak. So it may be worth storing your megaarray in a dimension-order that has the data you'll be operating on as the first dimension, if that makes sense.
Trivial example to show the difference between operating on rows vs columns:
>> A = randn(10000,10000);
>> tic; for n = 1 : 100; sum(A,1); end; toc
Elapsed time is 12.354861 seconds.
>> tic; for n = 1 : 100; sum(A,2); end; toc
Elapsed time is 22.298909 seconds.