Matrices with different row lengths in numpy - numpy

Is there a way of defining a matrix (say m) in numpy with rows of different lengths, but such that m stays 2-dimensional (i.e. m.ndim = 2)?
For example, if you define m = numpy.array([[1,2,3], [4,5]]), then m.ndim = 1. I understand why this happens, but I'm interested if there is any way to trick numpy into viewing m as 2D. One idea would be padding with a dummy value so that rows become equally sized, but I have lots of such matrices and it would take up too much space. The reason why I really need m to be 2D is that I am working with Theano, and the tensor which will be given the value of m expects a 2D value.

I'll give here very new information about Theano. We have a new TypedList() type, that allow to have python list with all elements with the same type: like 1d ndarray. All is done, except the documentation.
There is limited functionality you can do with them. But we did it to allow looping over the typed list with scan. It is not yet integrated with scan, but you can use it now like this:
import theano
import theano.typed_list
a = theano.typed_list.TypedListType(theano.tensor.fvector)()
s, _ = theano.scan(fn=lambda i, tl: tl[i].sum(),
non_sequences=[a],
sequences=[theano.tensor.arange(2, dtype='int64')])
f = theano.function([a], s)
f([[1, 2, 3], [4, 5]])
One limitation is that the output of scan must be an ndarray, not a typed list.

No, this is not possible. NumPy arrays need to be rectangular in every pair of dimensions. This is due to the way they map onto memory buffers, as a pointer, itemsize, stride triple.
As for this taking up space: np.array([[1,2,3], [4,5]]) actually takes up more space than a 2×3 array, because it's an array of two pointers to Python lists (and even if the elements were converted to arrays, the memory layout would still be inefficient).

Related

Creating large ndarray from multiple mem-mapped arrays

I have multiple large images stored on binary (fits) file on disc. Each array is of the same shape, and dtype.
I need to read in N of these images, but wish to preserve memory-mapping as they would swamp RAM. The easiest way to do this is, of course, read in as elements of a list. However, ideally I would like to treat this as a numpy array ( of shape [n, ny, nx]) e.g. for easy transpose etc.
Is this possible, without reading these in to RAM?
Note: in practice, what I need is more complicated, equivalent to reading in list-of-list (e.g. an M element list, each element itself an N element list, each a ndarray image), but an answer to the simple case above should hopefully be sufficient.
Thanks for any help.
You can either create a complex abstraction that creates an array-like interface to multiple files, or you can consolidate your data. The former is going to be fairly complex, and probably not worth your time.
Consolidating the data, e.g. in a temporary file, is a much simpler option, which I've implemented here with the assumption that you are using astropy for your FITS I/O. You can tailor it for other libraries or other use-cases as you see fit.
from tempfile import TemporaryFile
from astropy.io import fits
n = 0
with TemporaryFile() as output:
for filename in my_list_of_files:
with fits.open(filename) as hdus:
# If you have a single HDU that you know how to reference, get rid of the loop
for hdu in hdus:
if isinstance(hdu, fits.ImageHDU):
data = hdu.data.T
if n == 0:
shape = data.shape
dtype = data.dtype
elif data.shape != shape or data.dtype != dtype:
continue
data.tofile(output)
n += 1
Now you have a single binary flatfile with all your data in row-major order, and all the metadata you need to use numpy's memmap:
array = np.memmap(output, dtype, shape=(n,) + shape)
Do all your work in the outer with block, since output will be delete on close in this implementation.

How to retain indices of a matrix while working on one of its submatrices?

I am trying to implement an algorithm that iteratively removes some rows and columns of a matrix and continues processing the remaining submatrix. However, I would like to know the index of a value in the original matrix rather than the remaining submatrix.
For example, assume that a matrix x is built using
x = np.arange(9).reshape(3, 3)
Now, I would like to find the index of the element that is equal to 8 in the submatrix defined below:
np.where(x[1:, 1:] == 8)
By default, numpy returns (array[1], array[1]) because it is finding the element in the sliced submatrix. What I like to be returned instead is (array[2], array[2]), which is the index of 8 in the original matrix.
What is an efficient solution to this problem?
P.S.
The submatrix may be built arbitrarily. For example, I may need to keep rows, 0 and 1, but columns 0 and 2.
Each submatrix may be sliced in next iterations to make a smaller submatrix. I still would like to have access to the index in the original matrix. In other words, I am looking for a solution that works on submatrices of submatrices as well.
I recently learned about indexing with arrays where submatrices of a matrix can be selected using another numpy array. I think what I can do to solve the problem is to map indices of the submatrix to elements of the indexing array.
For example, in the example above, the submatrix can be defined like this:
row_idx = np.array([1, 2])
col_idx = np.array([1, 2])
np.where(x[row_idx[:, None], col_idx] == 8)
This will still return the same (array[1], array[1]) output, but I can use these indices to lookup the elements of row_idx and col_idx in order to find the corresponding indices in the original matrix, i.e. row_idx[1] and col_idx[1].

Using vectorize to apply function to each row in Numpy 2d array

I have a 1000x784 matrix of data (10000 examples and 784 features) called X_valid and I'd like to apply the following function to each row in this matrix and get the numerical result:
def predict_prob(x_valid, cov, mean, prior):
return -0.5 * (x_valid.T.dot(np.linalg.inv(cov)).dot(x_valid) + mean.T.dot(
np.linalg.inv(cov)).dot(mean) + np.linalg.slogdet(cov)[1]) + np.log(
prior)
(x_valid is simply a row of data). I'm using numpy's vectorize to do this with the following code:
v_predict_prob = np.vectorize(predict_prob)
scores = v_predict_prob(X_valid, covariance[num], means[num], priors[num])
(covariance[num], means[num], and priors[num] are just constants.)
However, I get the following error when running this:
File "problem_5.py", line 48, in predict_prob
return -0.5 * (x_valid.T.dot(np.linalg.inv(cov)).dot(x_valid) + mean.T.dot(np.linalg.inv(cov)).dot(mean) + np.linalg.slogdet(cov)[1]) + np.log(prior)
AttributeError: 'numpy.float64' object has no attribute 'dot'
That is, it's not passing in each row of the matrix individually. Instead, it is passing in each entry of the matrix (not what I want).
How can I alter this to get the desired behavior?
vectorize is NOT a general substitute for iteration, nor does it claim to be faster. It mainly streamlines access to the numpy broadcasting functionality. In general the function that you vectorize will take scalar inputs, not rows or 1d arrays.
I don't think there is a way of configuring vectorize to pass an array to your function as opposed to an item.
You describe x_valid as 2d that you want to evaluate row by row. And the other terms as 'constants' which you select with [num]. What shape are those constants?
You function treats a lot of these terms as 2d arrays:
x_valid.T.dot(np.linalg.inv(cov)).dot(x_valid) +
mean.T.dot(np.linalg.inv(cov)).dot(mean) +
np.linalg.slogdet(cov)[1]) + np.log(prior)
x_valid.T is meaningful only if x_valid is 2d. If it is 1d, the transpose does noting.
np.linalg.inv(cov) only makes sense if cov is 2d.
mean.T.dot... assumes mean is 2d.
np.linalg.slogdet(cov)[1] assumes np.linalg.slogdet(cov) has 2 or more elements (or rows).
You need to show us that the function works with some real arrays before jumping into iteration or 'vectorize'.
I suggest just using a for loop:
def v_predict_prob(X_valid, c, m, p):
out = []
for row in X_valid:
out.append(predict_prob(row, c, m, p))
return np.array(out)
Under the hood np.vectorize is doing the same thing: http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.vectorize.html
I know this question is a bit outdated, but I thought I would provide an answer for 2020.
Since the release of numpy 1.12, there is a new optional argument, "signature", which should allow 2D array functionality in most cases. Additionally, you will want to "exclude" the constants since they will not be vectorized.
All you would need to change is:
v_predict_prob = np.vectorize(predict_prob, exclude=['cov', 'mean', 'prior'], signature='(n)->()')
This signifies that the function should expect an n-dim array and output a scalar, and cov, mean, and prior will not be vectorized.

Differences between X.ravel() and X.reshape(s0*s1*s2) when number of axes known

Seeing this answer I am wondering if the creation of a flattened view of X are essentially the same, as long as I know that the number of axes in X is 3:
A = X.ravel()
s0, s1, s2 = X.shape
B = X.reshape(s0*s1*s2)
C = X.reshape(-1) # thanks to #hpaulj below
I'm not asking if A and B and C are the same.
I'm wondering if the particular use of ravel and reshape in this situation are essentially the same, or if there are significant differences, advantages, or disadvantages to one or the other, provided that you know the number of axes of X ahead of time.
The second method takes a few microseconds, but that does not seem to be size dependent.
Look at their __array_interface__ and do some timings. The only difference that I can see is that ravel is faster.
.flatten() has a more significant difference - it returns a copy.
A.reshape(-1)
is a simpler way to use reshape.
You could study the respective docs, and see if there is something else. I haven't explored what happens when you specify order.
I would use ravel if I just want it to be 1d. I use .reshape most often to change a 1d (e.g. arange()) to nd.
e.g.
np.arange(10).reshape(2,5).ravel()
Or choose the one that makes your code most readable.
reshape and ravel are defined in numpy C code:
In https://github.com/numpy/numpy/blob/0703f55f4db7a87c5a9e02d5165309994b9b13fd/numpy/core/src/multiarray/shape.c
PyArray_Ravel(PyArrayObject *arr, NPY_ORDER order) requires nearly 100 lines of C code. And it punts to PyArray_Flatten if the order changes.
In the same file, reshape punts to newshape. That in turn returns a view is the shape doesn't actually change, tries _attempt_nocopy_reshape, and as last resort returns a PyArray_NewCopy.
Both make use of PyArray_Newshape and PyArray_NewFromDescr - depending on how shapes and order mix and match.
So identifying where reshape (to 1d) and ravel are different would require careful study.
Another way to do this ravel is to make a new array, with a new shape, but the same data buffer:
np.ndarray((24,),buffer=A.data)
It times the same as reshape. Its __array_interface__ is the same. I don't recommend using this method, but it may clarify what is going on with these reshape/ravel functions. They all make a new array, with new shape, but with share data (if possible). Timing differences are the result of different sequences of function calls - in Python and C - not in different handling of the data.

Numpy sum over planes of 3d array, return a scalar

I'm making the transition from MATLAB to Numpy and feeling some growing pains.
I have a 3D array, lets say it's 3x3x3 and I want the scalar sum of each plane.
In matlab, I would use:
sum_vec = sum(3dArray,3);
TIA
wbg
EDIT: I was wrong about my matlab code. Matlab only vectorizes in one dim, so a loop wold be required. So numpy turns out to be more elegant...cool.
MATLAB
for i = 1:3
sum_vec(i) = sum(sum(3dArray(:,:,i));
end
You can do
sum_vec = np.array([plane.sum() for plane in cube])
or simply
sum_vec = cube.sum(-1).sum(-1)
where cube is your 3d array. You can specify 0 or 1 instead of -1 (or 2) depending on the orientation of the planes. The latter version is also better because it doesn't use a Python loop, which usually helps to improve performance when using numpy.
You should use the axis keyword in np.sum. Like in many other numpy functions, axis lets you perform the operation along a specific axis. For example, if you want to sum along the last dimension of the array, you would do:
import numpy as np
sum_vec = np.sum(3dArray, axis=-1)
And you'll get a resulting 2D array which corresponds to the sum along the last dimension to all the array slices 3dArray[i, k, :].
UPDATE
I didn't understand exactly what you wanted. You want to sum over two dimensions (a plane). In this case you can do two sums. For example, summing over the first two dimensions:
sum_vec = np.sum(np.sum(3dArray, axis=0), axis=0)
Instead of applying the same sum function twice, you may perform the sum on the reshaped array:
a = np.random.rand(10, 10, 10) # 3D array
b = a.view()
b.shape = (a.shape[0], -1)
c = np.sum(b, axis=1)
The above should be faster because you only sum once.
sumvec= np.sum(3DArray, axis=2)
or this works as well
sumvec=3DArray.sum(2)
Remember Python starts with 0 so axis=2 represent the 3rd dimension.
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.sum.html
If you're trying to sum over a plane (and avoid loops, which is always a good idea) you can use np.sum and pass two axes as a tuple for your argument.
For example, if you have an (nx3x3) array then using
np.sum(a, (1,2))
Will give an (nx1x1), summing over a plane, not a single axis.