I have many matrices to add. Let's say that the matrices are [M1, M2..., M_n]. Then, a simple way is
X = np.zeros()
for M in matrices:
X += M
In the operation, X += M, does Python create a new memory for X every time += is executed? If that's the case, that seems to be inefficient. Is there any way of doing an in-place operation without creating a new memory for X?
This works but is not faster on my machine:
numpy.sum(matrices, axis=0)
Unless you get MemoryError, trying to second guess memory usage in numpy is not worth the effort. Leave that to the developers who know the compiled code.
But we can perform some time tests - that's what really matters, doesn't it?
I'll test adding a good size array 100 times.
In [479]: M=np.ones((1000,1000))
Your iterative approach with +=
In [480]: %%timeit
...: X=np.zeros_like(M)
...: for _ in range(100): X+=M
...:
1 loop, best of 3: 627 ms per loop
Or make an array of size (100, 1000, 1000) and apply np.sum across the first axis.
In [481]: timeit np.sum(np.array([M for _ in range(100)]),axis=0)
1 loop, best of 3: 1.54 s per loop
and using the np.add ufunc. With reduce we can apply it sequentially to all values in a list.
In [482]: timeit np.add.reduce([M for _ in range(100)])
1 loop, best of 3: 1.53 s per loop
The np.sum case gives me a MemoryError if I use range(1000). I don't have enough memory to hold a (1000,1000,1000) array. Same for the add.reduce, which builds an array from the list.
What += does under the cover is normally hidden, and of no concern to us - usually. But for a peak under covers look at ufunc.at: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.at.html#numpy.ufunc.at
Performs unbuffered in place operation on operand ‘a’ for elements specified by ‘indices’. For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So X+=M does write the sum to a buffer, and then copies that buffer to X. There is a temporary buffer, but final memory usage does not change.
But that buffer creation and copying is done in fast C code.
np.add.at was added to deal with the case where that buffered action creates some problems (duplicate indices).
So it avoids that temporary buffer - but at a considerable speed cost. It's probably the added indexing capability that slows it down. (There may be a fairer add.at test; but it certainly doesn't help in this case.)
In [491]: %%timeit
...: X=np.zeros_like(M)
...: for _ in range(100): np.add.at(X,(slice(None),slice(None)),M)
1 loop, best of 3: 19.8 s per loop
Related
I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.
I have a list of sorted numpy arrays. What is the most efficient way to compute the sorted intersection of these arrays?
In my application, I expect the number of arrays to be less than 10^4, I expect the individual arrays to be of length less than 10^7, and I expect the length of the intersection to be close to p*N, where N is the length of the largest array and where 0.99 < p <= 1.0. The arrays are loaded from disk and can be loaded in batches if they won't all fit in memory at once.
A quick and dirty approach is to repeatedly invoke numpy.intersect1d(). That seems inefficient though as intersect1d() does not take advantage of the fact that the arrays are sorted.
Since intersect1d sort arrays each time, it's effectively inefficient.
Here you have to sweep intersection and each sample together to build the new intersection, which can be done in linear time, maintaining order.
Such task must often be tuned by hand with low level routines.
Here a way to do that with numba :
from numba import njit
import numpy as np
#njit
def drop_missing(intersect,sample):
i=j=k=0
new_intersect=np.empty_like(intersect)
while i< intersect.size and j < sample.size:
if intersect[i]==sample[j]: # the 99% case
new_intersect[k]=intersect[i]
k+=1
i+=1
j+=1
elif intersect[i]<sample[j]:
i+=1
else :
j+=1
return new_intersect[:k]
Now the samples :
n=10**7
ref=np.random.randint(0,n,n)
ref.sort()
def perturbation(sample,k):
rands=np.random.randint(0,n,k-1)
rands.sort()
l=np.split(sample,rands)
return np.concatenate([a[:-1] for a in l])
samples=[perturbation(ref,100) for _ in range(10)] #similar samples
And a run for 10 samples
def find_intersect(samples):
intersect=samples[0]
for sample in samples[1:]:
intersect=drop_missing(intersect,sample)
return intersect
In [18]: %time u=find_intersect(samples)
Wall time: 307 ms
In [19]: len(u)
Out[19]: 9999009
This way it seems that the job can be done in about 5 minutes , beyond loading time.
A few months ago, I wrote a C++-based python extension for this exact purpose. The package is called sortednp and is available via pip. The intersection of multiple sorted numpy arrays, for example, a, b and c, can be calculated with
import sortednp as snp
i = snp.kway_intersect(a, b, c)
By default, this uses an exponential search to advance the array indices internally which is pretty fast in cases where the intersection is small. In your case, it might be faster if you add algorithm=snp.SIMPLE_SEARCH to the method call.
Let's say I take a stream of incoming data (very fast) and I want to view various stats for a window (std deviation, (say, the last N samples, N being quite large). What's the most efficient way to do this with Python?
For example,
df=ps.DataFrame(np.random.random_sample(200000000))
df2 = df.append([5])
Is crashing my REPL environment in visual studio.
Is there a way to append to an array without this happening? Is there a way to tell which operations on the dataframe are computed incrementally other than by doing timeit on them?
I recommend building a circular buffer out of a numpy array.
This will involve keeping track of an index to your last updated point, and incrementing that when you add a new value.
import numpy as np
circular_buffer = np.zeros(200000000, dtype=np.float64)
head = 0
# stream input
circular_buffer[head] = new_value
if head == len(circular_buffer) - 1:
head = 0
else:
head += 1
Then you can compute the statistics normally on circular_buffer.
If You Don't Have Enough RAM
Try implementing something similar in bquery and bcolz. These store your data more efficiently than numpy (using compression) and offer similar performance. Bquery now has mean and standard deviation.
Note: I'm a contributer to bquery
I have a large numpy array that I am going to take a linear projection of using randomly generated values.
>>> input_array.shape
(50, 200000)
>>> random_array = np.random.normal(size=(200000, 300))
>>> output_array = np.dot(input_array, random_array)
Unfortunately, random_array takes up a lot of memory, and my machine starts swapping. It seems to me that I don't actually need all of random_array around at once; in theory, I ought to be able to generate it lazily during the dot product calculation...but I can't figure out how.
How can I reduce the memory footprint of the calculation of output_array from input_array?
This obviously isn't the fastest solution, but have you tried:
m, inner = input_array.shape
n = 300
out = np.empty((m, n))
for i in xrange(n):
out[:, i] = np.dot(input_array, np.random.normal(size=inner))
This might be a situation where using cython could reduce your memory usage. You could generate the random numbers on the fly and accumulate the result as you go. I don't have the time to write and test the full function, but you would definitely want to use randomkit (the library that numpy uses under the hood) at the c-level.
You can take a look at some example code I wrote for another application to see how to wrap randomkit:
https://github.com/synapticarbors/pylangevin-integrator/blob/master/cIntegrator.pyx
And also check out how matrix multiplication is implemented in the following paper on cython:
http://conference.scipy.org/proceedings/SciPy2009/paper_2/full_text.pdf
Instead of having both arrays as inputs, just have input_array as one, and then in the method, generate small chunks of the random array as you go.
Sorry if it is just a sketch instead of actual code, but hopefully it is enough to get you started.
I'm working on some matlab code which is processing large (but not huge) datasets: 10,000 784 element vectors (not sparse), and calculating information about that which is stored in a 10,000x10 sparse matrix. In order to get the code working I did some of the trickier parts iteratively, doing loops over the 10k items to process them, and a few a loop over the 10 items in the sparse matrix for cleanup.
My process initially took 73 iterations (so, on the order of 730k loops) to process, and ran in about 120 seconds. Not bad, but this is matlab, so I set out to vectorize it to speed it up.
In the end I have a fully vectorized solution which gets the same answer (so it's correct, or at least as correct as my initial solution), but takes 274 seconds to run, it's almost half as fast!
This is the first time I've ran into matlab code which runs slower vectorized than it does iteratively. Are there any rules of thumb or best practices for identifying when this is likely / possible?
I'd love to share the code for some feedback, but it's for a currently open school assignment so I really can't right now. If it ends up being one of those "Wow, that's weird, you probably did something wrong things" I'll probably revisit this in a week or two to see if my vectorization is somehow off.
Vectorisation in Matlab often means allocating a lot more memory (making a much larger array to avoid the loop eg by tony's trick). With improved JIT compiling of loops in recent versions - its possible that the memory allocation required for your vectorised solution means there is no advantage, but without seeing the code it's hard to say. Matlab has an excellent line-by-line profiler which should help you see which particular parts of the vectorised version are taking the time.
Have you tried plotting the execution time as a function of problem size (either the number of elements per vector [currently 784], or the number of vectors [currently 10,000])? I ran into a similar anomaly when vectorizing a Gram-Schmidt orthogonalization algorithm; it turned out that the vectorized version was faster until the problem grew to a certain size, at which point the iterative version actually ran faster, as seen in this plot:
Here are the two implementations and the benchmarking script:
clgs.m
function [Q,R] = clgs(A)
% QR factorization by unvectorized classical Gram-Schmidt orthogonalization
[m,n] = size(A);
R = zeros(n,n); % pre-allocate upper-triangular matrix
% iterate over columns
for j = 1:n
v = A(:,j);
% iterate over remaining columns
for i = 1:j-1
R(i,j) = A(:,i)' * A(:,j);
v = v - R(i,j) * A(:,i);
end
R(j,j) = norm(v);
A(:,j) = v / norm(v); % normalize
end
Q = A;
clgs2.m
function [Q,R] = clgs2(A)
% QR factorization by classical Gram-Schmidt orthogonalization with a
% vectorized inner loop
[m,n] = size(A);
R = zeros(n,n); % pre-allocate upper-triangular matrix
for k=1:n
R(1:k-1,k) = A(:,1:k-1)' * A(:,k);
A(:,k) = A(:,k) - A(:,1:k-1) * R(1:k-1,k);
R(k,k) = norm(A(:,k));
A(:,k) = A(:,k) / R(k,k);
end
Q = A;
benchgs.m
n = [300,350,400,450,500];
clgs_time=zeros(length(n),1);
clgs2_time=clgs_time;
for i = 1:length(n)
A = rand(n(i));
tic;
[Q,R] = clgs(A);
clgs_time(i) = toc;
tic;
[Q,R] = clgs2(A);
clgs2_time(i) = toc;
end
semilogy(n,clgs_time,'b',n,clgs2_time,'r')
xlabel 'n', ylabel 'Time [seconds]'
legend('unvectorized CGS','vectorized CGS')
To answer the question "When not to vectorize MATLAB code" more generally:
Don't vectorize code if the vectorization is not straight forward and makes the code very hard to read. This is under the assumption that
Other people than you might need to read and understand it.
The unvectorized code is fast enough for what you need.
This won't be a very specific answer, but I deal with extremely large datasets (4D cardiac datasets).
There are occasions where I need to perform an operation that involves a number of 4D sets. I can either create a loop, or a vectorised operation that essentially works on a concatenated 5D object. (e.g. as a trivial example, say you wanted to get the average 4D object, you could either create a loop collecting a walking-average, or concatenate in the 5th dimension, and use the mean function over it).
In my experience, putting aside the time it will take to create the 5D object in the first place, presumably due to the sheer size and memory access leaps involved when performing calculations, it is usually a lot faster to resort to a loop of the still large, but a lot more manageable 4D objects.
The other "microoptimisation" trick I will point out is that matlab is "column major order". Meaning, for my trivial example, I believe it would be faster to be averaging along the 1st dimension, rather than the 5th one, as the former involves contiguous locations in memory, whereas the latter involves huge jumps, so to speak. So it may be worth storing your megaarray in a dimension-order that has the data you'll be operating on as the first dimension, if that makes sense.
Trivial example to show the difference between operating on rows vs columns:
>> A = randn(10000,10000);
>> tic; for n = 1 : 100; sum(A,1); end; toc
Elapsed time is 12.354861 seconds.
>> tic; for n = 1 : 100; sum(A,2); end; toc
Elapsed time is 22.298909 seconds.