In numpy, there is a flatten operation which allows you to, for example, flatten a m x n matrix down to an array of mn elements, and a reshape operations which goes in the opposite direction. Much of the time this can be done with a view, without creating a copy of the original data.
Does such a capability exist in Parallel Colt, the Java matrix library? I have not been able to find one. There is a reshape method on one-dimensional matrices, but it appears to create copies.
Related
I am using sparse matrices in python, namely
scipy.sparse.csr_matrix
I am in principle free to choose the exact sparse implementation, as long as the matrices support matrix-vector multiplication and addition/subtraction of matrices with the same sparsity pattern. Currently, at every time step, I construct a new sparse matrix from scratch and add it to the existing matrix. I believe that my code could be unnecessarily losing time due to
Construction time of sparse matrix
Addition of the sparse matrices, assuming that the underlying algorithm inside CSR matrix implementation has to find matching sparse entries before adding them up.
My guess would be that the sparse matrix is internally stored as a numpy array of values + a few index arrays denoting where those values are located. The question is if it is possible to directly add the underlying value arrays without touching the sparsity structure. Is something like this possible?
new_values = np.linspace(0, num_values)
csr_mat.val += new_values
Quite simply, what I want to do is the following
A = np.ones((3,3)) #arbitrary matrix
B = np.ones((2,2)) #arbitrary matrix
A[1:,1:] = A[1:,1:] + B
except in Tensorflow (where the matrices can be arbitrarily complicated tensor expressions). Neither A nor B is a Tensorflow Variable, but just a run-of-the-mill tensor.
What I have gathered so far: tensors are immutable, so I cannot assign to a submatrix. tf.scatter_nd is the current option for sub-assignment, but does not appear to support sub-matrices, only slices.
Methods that should work, but are perhaps not ideal:
I could pad B with zeros, but I'm sure this leads to instantiation of
an unnecessarily large B - can it be made sparse, maybe?
I could use the padding idea, but write it as a low-rank decomposition, e.g. in Numpy: A+U.dot(B).U.T where U is a stacked zero and identity matrix. I'm not sure this is actually advantageous.
I could split A into submatrices, and stack them back together. Might be the most efficient, but sounds like the code would be convoluted.
Ideally, I want to do this operation N times for progressively smaller matrices, resulting in one large final result, but this is tangential.
I'll use one of the hacks for now, but I'm hoping someone can tell me what the idiomatic version is!
I have to operate on matrices using an equivalent of sicpy's sparse.coo_matrix and sparse.csr_matrix. However, I cannot use scipy (it is incompatible with the image analysis software I want to use this in). I can, however, use numpy.
Is there an easy way to accomplish what scipy.sparse.coo_matrix and scipy.sparse.csr_matrix do, with numpy only?
Thanks!
The attributes of a sparse.coo_matrix are:
dtype : dtype
Data type of the matrix
shape : 2-tuple
Shape of the matrix
ndim : int
Number of dimensions (this is always 2)
nnz
Number of nonzero elements
data
COO format data array of the matrix
row
COO format row index array of the matrix
col
COO format column index array of the matrix
The data, row, col arrays are essentially the data, i, j parameters when defined with coo_matrix((data, (i, j)), [shape=(M, N)]). shape also comes from the definition. dtype from the data array. nzz as first approximation is the length of data (not accounting for zeros and duplicate coordinates).
So it is easy to construct a coo like object. Similarly a lil matrix has 2 lists of lists. And a dok matrix is a dictionary (see its .__class__.__mro__).
The data structure of a csr matrix is a bit more obscure:
data
CSR format data array of the matrix
indices
CSR format index array of the matrix
indptr
CSR format index pointer array of the matrix
It still has 3 arrays. And they can be derived from the coo arrays. But doing so with pure Python code won't be nearly as fast as the compiled scipy functions.
But these classes have a lot of functionality that would require a lot of work to duplicate. Some is pure Python, but critical pieces are compiled for speed. Particularly important are the mathematical operations that the csr_matrix implements, such as matrix multiplication.
Replicating the data structures for temporary storage is one thing; replicating the functionality is quite another.
Is there a way of defining a matrix (say m) in numpy with rows of different lengths, but such that m stays 2-dimensional (i.e. m.ndim = 2)?
For example, if you define m = numpy.array([[1,2,3], [4,5]]), then m.ndim = 1. I understand why this happens, but I'm interested if there is any way to trick numpy into viewing m as 2D. One idea would be padding with a dummy value so that rows become equally sized, but I have lots of such matrices and it would take up too much space. The reason why I really need m to be 2D is that I am working with Theano, and the tensor which will be given the value of m expects a 2D value.
I'll give here very new information about Theano. We have a new TypedList() type, that allow to have python list with all elements with the same type: like 1d ndarray. All is done, except the documentation.
There is limited functionality you can do with them. But we did it to allow looping over the typed list with scan. It is not yet integrated with scan, but you can use it now like this:
import theano
import theano.typed_list
a = theano.typed_list.TypedListType(theano.tensor.fvector)()
s, _ = theano.scan(fn=lambda i, tl: tl[i].sum(),
non_sequences=[a],
sequences=[theano.tensor.arange(2, dtype='int64')])
f = theano.function([a], s)
f([[1, 2, 3], [4, 5]])
One limitation is that the output of scan must be an ndarray, not a typed list.
No, this is not possible. NumPy arrays need to be rectangular in every pair of dimensions. This is due to the way they map onto memory buffers, as a pointer, itemsize, stride triple.
As for this taking up space: np.array([[1,2,3], [4,5]]) actually takes up more space than a 2×3 array, because it's an array of two pointers to Python lists (and even if the elements were converted to arrays, the memory layout would still be inefficient).
I've been unable to figure out how to access, add, multiply, replace, etc. single columns of a NumPy matrix. I can do this via looping over individual elements of the column, but I'd like to treat the column as a unit, something that I can do with rows.
When I've tried to search I'm usually directed to answers handling NumPy arrays, but this is not the same thing.
Can you provide code that's giving trouble? The operations on columns that you list are among the most basic operations that are supported and optimized in NumPy. Consider looking over the tutorial on NumPy for MATLAB users, which has many examples of accessing rows or columns, performing vectorized operations on them, and modifying them with copies or in-place.
NumPy for MATLAB Users
Just to clarify, suppose you have a 2-dimensional NumPy ndarray or matrix called a. Then a[:, 0] would access the first column just the same as a[0] or a[0, :] would access the first row. Any operations that work for rows should work for columns as well, with some caveats for broadcasting rules and certain mathematical operations that depend upon array alignment. You can also use the numpy.transpose(a) function (which is also exposed with a.T) to transpose a making the columns become rows.