Creating large ndarray from multiple mem-mapped arrays - numpy

I have multiple large images stored on binary (fits) file on disc. Each array is of the same shape, and dtype.
I need to read in N of these images, but wish to preserve memory-mapping as they would swamp RAM. The easiest way to do this is, of course, read in as elements of a list. However, ideally I would like to treat this as a numpy array ( of shape [n, ny, nx]) e.g. for easy transpose etc.
Is this possible, without reading these in to RAM?
Note: in practice, what I need is more complicated, equivalent to reading in list-of-list (e.g. an M element list, each element itself an N element list, each a ndarray image), but an answer to the simple case above should hopefully be sufficient.
Thanks for any help.

You can either create a complex abstraction that creates an array-like interface to multiple files, or you can consolidate your data. The former is going to be fairly complex, and probably not worth your time.
Consolidating the data, e.g. in a temporary file, is a much simpler option, which I've implemented here with the assumption that you are using astropy for your FITS I/O. You can tailor it for other libraries or other use-cases as you see fit.
from tempfile import TemporaryFile
from astropy.io import fits
n = 0
with TemporaryFile() as output:
for filename in my_list_of_files:
with fits.open(filename) as hdus:
# If you have a single HDU that you know how to reference, get rid of the loop
for hdu in hdus:
if isinstance(hdu, fits.ImageHDU):
data = hdu.data.T
if n == 0:
shape = data.shape
dtype = data.dtype
elif data.shape != shape or data.dtype != dtype:
continue
data.tofile(output)
n += 1
Now you have a single binary flatfile with all your data in row-major order, and all the metadata you need to use numpy's memmap:
array = np.memmap(output, dtype, shape=(n,) + shape)
Do all your work in the outer with block, since output will be delete on close in this implementation.

Related

use numpy to make a collection dictionary for some statistical objects

I want to use numpy to make a collection dictionary for some statistical objects and the simplified state is as follows.
There are respectively a scalar-array noted as
a = np.array([n1,n2,n3...])
and a 2D-array as
b = np.array([[q1_1,q1_2],[q2_1,q2_2],[q3_1,q3_2]...])
For each element ni in a, I want to pick out all the elements qi([qi_1,qi_2]) that contain ni in b and make a dict with the key as ni to collect them.
I have recorded a clumsy method for this purpose (assume that a and b are determined) into the following codes as:
import numpy as np
a = np.array([i+1 for i in range(100)])
b = np.array([[2*i+1,2*(i+1)] for i in range(50)])
dict = {}
for i in a: dict[i] = [j for j in b if i in j]
There's no doubt, that when a and b are large, this will be very slow.
Is there any other efficient way to replace the above one?
Seeking your help!
thanks for your idea. It can solve completely my problem. Your core concept is to make a comparison of a and b and to get the Boolean array as the result. So, it is much fast to use this Boolean index for the array b to bulid the dictionary. Follow this idea, I rewrite your codes in my own way as that
dict = {}
for item in a:
index_left, index_right = (b[:,0]==item), (b[:,1]==item)
index = np.logical_or(index_left, index_right)
dict[item] = dict[index]
These codes are still not faster than yours but can avoid the 'memories error' even in large a and b(eg. a=100000 and b=200000)
Numpy arrays allow elementwise comparison:
equal = b[:,:,np.newaxis]==a #np.newaxis to broadcast
# if one of the two is equal, we will include this element
index = np.logical_or(equal[:,0], equal[:,1])
# indexing by a boolean array to get the result
dictionary = {i: b[index[:,i]] for i in range(len(a))}
As a final remark: Are you sure you want to use a dictionary? By this you lose a lot of the numpy advantages
Edit, answer to your comment:
With a and b this large, equal will have size 10^10, which makes 8*10^10 bytes, which is approximately 72 G. That's why you get this error.
The main question you should ask is: Do I really need this big arrays? If yes, are you sure, that the dictionary will not be to large as well?
The problem can be solved, by not computing everything at once, but in n times, n should be about 72/16 (the proportion in memory) in your case. However having n a little bit larger will probably speed up the process:
stride = int(len(b)/n)
dictionary = {}
for i in range(n):
#splitting b into several parts
equal = b[n*stride:(n+1)*stride,:,np.newaxis]==a
index = np.logical_or(equal[:,0], equal[:,1])
dictionary.update( {i: b[index[:,i]] for i in range(len(a))})

Differences between X.ravel() and X.reshape(s0*s1*s2) when number of axes known

Seeing this answer I am wondering if the creation of a flattened view of X are essentially the same, as long as I know that the number of axes in X is 3:
A = X.ravel()
s0, s1, s2 = X.shape
B = X.reshape(s0*s1*s2)
C = X.reshape(-1) # thanks to #hpaulj below
I'm not asking if A and B and C are the same.
I'm wondering if the particular use of ravel and reshape in this situation are essentially the same, or if there are significant differences, advantages, or disadvantages to one or the other, provided that you know the number of axes of X ahead of time.
The second method takes a few microseconds, but that does not seem to be size dependent.
Look at their __array_interface__ and do some timings. The only difference that I can see is that ravel is faster.
.flatten() has a more significant difference - it returns a copy.
A.reshape(-1)
is a simpler way to use reshape.
You could study the respective docs, and see if there is something else. I haven't explored what happens when you specify order.
I would use ravel if I just want it to be 1d. I use .reshape most often to change a 1d (e.g. arange()) to nd.
e.g.
np.arange(10).reshape(2,5).ravel()
Or choose the one that makes your code most readable.
reshape and ravel are defined in numpy C code:
In https://github.com/numpy/numpy/blob/0703f55f4db7a87c5a9e02d5165309994b9b13fd/numpy/core/src/multiarray/shape.c
PyArray_Ravel(PyArrayObject *arr, NPY_ORDER order) requires nearly 100 lines of C code. And it punts to PyArray_Flatten if the order changes.
In the same file, reshape punts to newshape. That in turn returns a view is the shape doesn't actually change, tries _attempt_nocopy_reshape, and as last resort returns a PyArray_NewCopy.
Both make use of PyArray_Newshape and PyArray_NewFromDescr - depending on how shapes and order mix and match.
So identifying where reshape (to 1d) and ravel are different would require careful study.
Another way to do this ravel is to make a new array, with a new shape, but the same data buffer:
np.ndarray((24,),buffer=A.data)
It times the same as reshape. Its __array_interface__ is the same. I don't recommend using this method, but it may clarify what is going on with these reshape/ravel functions. They all make a new array, with new shape, but with share data (if possible). Timing differences are the result of different sequences of function calls - in Python and C - not in different handling of the data.

Incrementally and statistically measuring samples with Python (Numpy, Pandas, etc) and performance

Let's say I take a stream of incoming data (very fast) and I want to view various stats for a window (std deviation, (say, the last N samples, N being quite large). What's the most efficient way to do this with Python?
For example,
df=ps.DataFrame(np.random.random_sample(200000000))
df2 = df.append([5])
Is crashing my REPL environment in visual studio.
Is there a way to append to an array without this happening? Is there a way to tell which operations on the dataframe are computed incrementally other than by doing timeit on them?
I recommend building a circular buffer out of a numpy array.
This will involve keeping track of an index to your last updated point, and incrementing that when you add a new value.
import numpy as np
circular_buffer = np.zeros(200000000, dtype=np.float64)
head = 0
# stream input
circular_buffer[head] = new_value
if head == len(circular_buffer) - 1:
head = 0
else:
head += 1
Then you can compute the statistics normally on circular_buffer.
If You Don't Have Enough RAM
Try implementing something similar in bquery and bcolz. These store your data more efficiently than numpy (using compression) and offer similar performance. Bquery now has mean and standard deviation.
Note: I'm a contributer to bquery

Matrices with different row lengths in numpy

Is there a way of defining a matrix (say m) in numpy with rows of different lengths, but such that m stays 2-dimensional (i.e. m.ndim = 2)?
For example, if you define m = numpy.array([[1,2,3], [4,5]]), then m.ndim = 1. I understand why this happens, but I'm interested if there is any way to trick numpy into viewing m as 2D. One idea would be padding with a dummy value so that rows become equally sized, but I have lots of such matrices and it would take up too much space. The reason why I really need m to be 2D is that I am working with Theano, and the tensor which will be given the value of m expects a 2D value.
I'll give here very new information about Theano. We have a new TypedList() type, that allow to have python list with all elements with the same type: like 1d ndarray. All is done, except the documentation.
There is limited functionality you can do with them. But we did it to allow looping over the typed list with scan. It is not yet integrated with scan, but you can use it now like this:
import theano
import theano.typed_list
a = theano.typed_list.TypedListType(theano.tensor.fvector)()
s, _ = theano.scan(fn=lambda i, tl: tl[i].sum(),
non_sequences=[a],
sequences=[theano.tensor.arange(2, dtype='int64')])
f = theano.function([a], s)
f([[1, 2, 3], [4, 5]])
One limitation is that the output of scan must be an ndarray, not a typed list.
No, this is not possible. NumPy arrays need to be rectangular in every pair of dimensions. This is due to the way they map onto memory buffers, as a pointer, itemsize, stride triple.
As for this taking up space: np.array([[1,2,3], [4,5]]) actually takes up more space than a 2×3 array, because it's an array of two pointers to Python lists (and even if the elements were converted to arrays, the memory layout would still be inefficient).

Numpy: Reduce memory footprint of dot product with random data

I have a large numpy array that I am going to take a linear projection of using randomly generated values.
>>> input_array.shape
(50, 200000)
>>> random_array = np.random.normal(size=(200000, 300))
>>> output_array = np.dot(input_array, random_array)
Unfortunately, random_array takes up a lot of memory, and my machine starts swapping. It seems to me that I don't actually need all of random_array around at once; in theory, I ought to be able to generate it lazily during the dot product calculation...but I can't figure out how.
How can I reduce the memory footprint of the calculation of output_array from input_array?
This obviously isn't the fastest solution, but have you tried:
m, inner = input_array.shape
n = 300
out = np.empty((m, n))
for i in xrange(n):
out[:, i] = np.dot(input_array, np.random.normal(size=inner))
This might be a situation where using cython could reduce your memory usage. You could generate the random numbers on the fly and accumulate the result as you go. I don't have the time to write and test the full function, but you would definitely want to use randomkit (the library that numpy uses under the hood) at the c-level.
You can take a look at some example code I wrote for another application to see how to wrap randomkit:
https://github.com/synapticarbors/pylangevin-integrator/blob/master/cIntegrator.pyx
And also check out how matrix multiplication is implemented in the following paper on cython:
http://conference.scipy.org/proceedings/SciPy2009/paper_2/full_text.pdf
Instead of having both arrays as inputs, just have input_array as one, and then in the method, generate small chunks of the random array as you go.
Sorry if it is just a sketch instead of actual code, but hopefully it is enough to get you started.