I want to use numpy to make a collection dictionary for some statistical objects and the simplified state is as follows.
There are respectively a scalar-array noted as
a = np.array([n1,n2,n3...])
and a 2D-array as
b = np.array([[q1_1,q1_2],[q2_1,q2_2],[q3_1,q3_2]...])
For each element ni in a, I want to pick out all the elements qi([qi_1,qi_2]) that contain ni in b and make a dict with the key as ni to collect them.
I have recorded a clumsy method for this purpose (assume that a and b are determined) into the following codes as:
import numpy as np
a = np.array([i+1 for i in range(100)])
b = np.array([[2*i+1,2*(i+1)] for i in range(50)])
dict = {}
for i in a: dict[i] = [j for j in b if i in j]
There's no doubt, that when a and b are large, this will be very slow.
Is there any other efficient way to replace the above one?
Seeking your help!
thanks for your idea. It can solve completely my problem. Your core concept is to make a comparison of a and b and to get the Boolean array as the result. So, it is much fast to use this Boolean index for the array b to bulid the dictionary. Follow this idea, I rewrite your codes in my own way as that
dict = {}
for item in a:
index_left, index_right = (b[:,0]==item), (b[:,1]==item)
index = np.logical_or(index_left, index_right)
dict[item] = dict[index]
These codes are still not faster than yours but can avoid the 'memories error' even in large a and b(eg. a=100000 and b=200000)
Numpy arrays allow elementwise comparison:
equal = b[:,:,np.newaxis]==a #np.newaxis to broadcast
# if one of the two is equal, we will include this element
index = np.logical_or(equal[:,0], equal[:,1])
# indexing by a boolean array to get the result
dictionary = {i: b[index[:,i]] for i in range(len(a))}
As a final remark: Are you sure you want to use a dictionary? By this you lose a lot of the numpy advantages
Edit, answer to your comment:
With a and b this large, equal will have size 10^10, which makes 8*10^10 bytes, which is approximately 72 G. That's why you get this error.
The main question you should ask is: Do I really need this big arrays? If yes, are you sure, that the dictionary will not be to large as well?
The problem can be solved, by not computing everything at once, but in n times, n should be about 72/16 (the proportion in memory) in your case. However having n a little bit larger will probably speed up the process:
stride = int(len(b)/n)
dictionary = {}
for i in range(n):
#splitting b into several parts
equal = b[n*stride:(n+1)*stride,:,np.newaxis]==a
index = np.logical_or(equal[:,0], equal[:,1])
dictionary.update( {i: b[index[:,i]] for i in range(len(a))})
Related
I have multiple large images stored on binary (fits) file on disc. Each array is of the same shape, and dtype.
I need to read in N of these images, but wish to preserve memory-mapping as they would swamp RAM. The easiest way to do this is, of course, read in as elements of a list. However, ideally I would like to treat this as a numpy array ( of shape [n, ny, nx]) e.g. for easy transpose etc.
Is this possible, without reading these in to RAM?
Note: in practice, what I need is more complicated, equivalent to reading in list-of-list (e.g. an M element list, each element itself an N element list, each a ndarray image), but an answer to the simple case above should hopefully be sufficient.
Thanks for any help.
You can either create a complex abstraction that creates an array-like interface to multiple files, or you can consolidate your data. The former is going to be fairly complex, and probably not worth your time.
Consolidating the data, e.g. in a temporary file, is a much simpler option, which I've implemented here with the assumption that you are using astropy for your FITS I/O. You can tailor it for other libraries or other use-cases as you see fit.
from tempfile import TemporaryFile
from astropy.io import fits
n = 0
with TemporaryFile() as output:
for filename in my_list_of_files:
with fits.open(filename) as hdus:
# If you have a single HDU that you know how to reference, get rid of the loop
for hdu in hdus:
if isinstance(hdu, fits.ImageHDU):
data = hdu.data.T
if n == 0:
shape = data.shape
dtype = data.dtype
elif data.shape != shape or data.dtype != dtype:
continue
data.tofile(output)
n += 1
Now you have a single binary flatfile with all your data in row-major order, and all the metadata you need to use numpy's memmap:
array = np.memmap(output, dtype, shape=(n,) + shape)
Do all your work in the outer with block, since output will be delete on close in this implementation.
I have a list of sorted numpy arrays. What is the most efficient way to compute the sorted intersection of these arrays?
In my application, I expect the number of arrays to be less than 10^4, I expect the individual arrays to be of length less than 10^7, and I expect the length of the intersection to be close to p*N, where N is the length of the largest array and where 0.99 < p <= 1.0. The arrays are loaded from disk and can be loaded in batches if they won't all fit in memory at once.
A quick and dirty approach is to repeatedly invoke numpy.intersect1d(). That seems inefficient though as intersect1d() does not take advantage of the fact that the arrays are sorted.
Since intersect1d sort arrays each time, it's effectively inefficient.
Here you have to sweep intersection and each sample together to build the new intersection, which can be done in linear time, maintaining order.
Such task must often be tuned by hand with low level routines.
Here a way to do that with numba :
from numba import njit
import numpy as np
#njit
def drop_missing(intersect,sample):
i=j=k=0
new_intersect=np.empty_like(intersect)
while i< intersect.size and j < sample.size:
if intersect[i]==sample[j]: # the 99% case
new_intersect[k]=intersect[i]
k+=1
i+=1
j+=1
elif intersect[i]<sample[j]:
i+=1
else :
j+=1
return new_intersect[:k]
Now the samples :
n=10**7
ref=np.random.randint(0,n,n)
ref.sort()
def perturbation(sample,k):
rands=np.random.randint(0,n,k-1)
rands.sort()
l=np.split(sample,rands)
return np.concatenate([a[:-1] for a in l])
samples=[perturbation(ref,100) for _ in range(10)] #similar samples
And a run for 10 samples
def find_intersect(samples):
intersect=samples[0]
for sample in samples[1:]:
intersect=drop_missing(intersect,sample)
return intersect
In [18]: %time u=find_intersect(samples)
Wall time: 307 ms
In [19]: len(u)
Out[19]: 9999009
This way it seems that the job can be done in about 5 minutes , beyond loading time.
A few months ago, I wrote a C++-based python extension for this exact purpose. The package is called sortednp and is available via pip. The intersection of multiple sorted numpy arrays, for example, a, b and c, can be calculated with
import sortednp as snp
i = snp.kway_intersect(a, b, c)
By default, this uses an exponential search to advance the array indices internally which is pretty fast in cases where the intersection is small. In your case, it might be faster if you add algorithm=snp.SIMPLE_SEARCH to the method call.
Is there a way of defining a matrix (say m) in numpy with rows of different lengths, but such that m stays 2-dimensional (i.e. m.ndim = 2)?
For example, if you define m = numpy.array([[1,2,3], [4,5]]), then m.ndim = 1. I understand why this happens, but I'm interested if there is any way to trick numpy into viewing m as 2D. One idea would be padding with a dummy value so that rows become equally sized, but I have lots of such matrices and it would take up too much space. The reason why I really need m to be 2D is that I am working with Theano, and the tensor which will be given the value of m expects a 2D value.
I'll give here very new information about Theano. We have a new TypedList() type, that allow to have python list with all elements with the same type: like 1d ndarray. All is done, except the documentation.
There is limited functionality you can do with them. But we did it to allow looping over the typed list with scan. It is not yet integrated with scan, but you can use it now like this:
import theano
import theano.typed_list
a = theano.typed_list.TypedListType(theano.tensor.fvector)()
s, _ = theano.scan(fn=lambda i, tl: tl[i].sum(),
non_sequences=[a],
sequences=[theano.tensor.arange(2, dtype='int64')])
f = theano.function([a], s)
f([[1, 2, 3], [4, 5]])
One limitation is that the output of scan must be an ndarray, not a typed list.
No, this is not possible. NumPy arrays need to be rectangular in every pair of dimensions. This is due to the way they map onto memory buffers, as a pointer, itemsize, stride triple.
As for this taking up space: np.array([[1,2,3], [4,5]]) actually takes up more space than a 2×3 array, because it's an array of two pointers to Python lists (and even if the elements were converted to arrays, the memory layout would still be inefficient).
I have a large dataset of compound data in a hdf file. The Type of the compound data looks as following:
numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)),
('NextLevel', h5py.special_dtype(ref=h5py.Reference))])
With that I create a dataset with references to an image and another dataset at each position.
These datasets have the dimensions n x n, with n typically at least 256, but more likely >2000.
I have to initially fill each position of these datasets with the same value:
[[(image.ref, dataset.ref)...(image.ref, dataset.ref)],
.
.
.
[(image.ref, dataset.ref)...(image.ref, dataset.ref)]]
I try to avoid filling it with two for-loops like:
for i in xrange(0,n):
for j in xrange(0,n):
daset[i,j] =(image.ref, dataset.ref)
because the performance is very bad.
So I'm searching for something like numpy.fill, numpy.shape, numpy.reshape, numpy.array, numpy.arrange, [:] and so on. I tried those functions in various ways, but they all seem to work only with number and string datatypes.
Is there any way to fill these datasets in a faster way then the for-loops?
Thank you in advance.
You can use either numpy broadcasting or a combination of numpy.repeat and numpy.reshape:
my_dtype = numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)),
('NextLevel', h5py.special_dtype(ref=h5py.Reference))])
ref_array = array( (image.ref, dataset.ref), dtype=my_dtype)
dataset = numpy.repeat(ref_array, n*n)
dataset = dataset.reshape( (n,n) )
Note that numpy.repeat returns a flattened array, hence the use of numpy.reshape. It seems repeat is faster than just broadcasting it:
%timeit empty_dataset=np.empty(2*2,dtype=my_dtype); empty_dataset[:]=ref_array
100000 loops, best of 3: 9.09 us per loop
%timeit repeat_dataset=np.repeat(ref_array, 2*2).reshape((2,2))
100000 loops, best of 3: 5.92 us per loop
Many array methods return a single index despite the fact that the array is multidimensional. For example:
a = rand(2,3)
z = a.argmax()
For two dimensions, it is easy to find the matrix indices of the maximum element:
a[z/3, z%3]
But for more dimensions, it can become annoying. Does Numpy/Scipy have a simple way of returning the indices in multiple dimensions given an index in one (collapsed) dimension? Thanks.
Got it!
a = X.argmax()
(i,j) = unravel_index(a, X.shape)
I don't know of an built-in function that does what you want, but where this
has come up for me, I realized that what I really wanted to do was this:
given 2 arrays a,b with the same shape, find the element of b which is in
the same position (same [i,j,k...] position) as the maximum element of a
For this, the quick numpy-ish solution is:
j = a.flatten().argmax()
corresponding_b_element = b.flatten()[j]
Vince Marchetti