How to construct a matrix based on an array in numpy? - numpy

I am trying to do a function iteratively to an array, and make a matrix composed of what it returns. If this was native python, what I would do is:
[func(x, y) for y in xrange(Y)]
but if I do that, I need to wrap it with numpy.matrix() to vectorize it. What is the numpy way of doing this? Right now I am initializing a zeros matrix and then populating it with elements I get from a for loop, but that seems inefficient.

Take a look at the numpy tutorial, especially the part about Universal Functions or ufuncs. A ufunc is:
Functions that operate element by element on whole arrays.
which sounds like what you're asking for. Keep in mind that you probably don't need to write your own ufunc, but just write func in terms of existing ufuncs. For example:
def hypot(a, b):
return np.sqrt(a**2 + b**2)
>>> a = np.array([3., 5., 10.])
>>> b = np.array([4., 12., 24.,])
>>> hypot(a, b)
array([ 5., 13., 26.])

Related

How to vectorize this operation in numpy?

I have a 2d array s and I want to calculate differences elementwise, i.e.:
Since it cannot be written as a single matrix multiplication, I was wondering what is the proper way to vectorize it?
You can use broadcasting for that: d = s[:, None, :] - s[None, :, :]. Note the None enable you to create a new dimension. Numpy implicitly perform the broadcasting operation between the two arrays.

Efficient axis-wise cartesian product of multiple 2D matrices with Numpy or TensorFlow

So first off, I think what I'm trying to achieve is some sort of Cartesian product but elementwise, across the columns only.
What I'm trying to do is, if you have multiple 2D arrays of size [ (N,D1), (N,D2), (N,D3)...(N,Dn) ]
The result is thus to be a combinatorial product across axis=1 such that the final result will then be of shape (N, D) where D=D1*D2*D3*...Dn
e.g.
A = np.array([[1,2],
[3,4]])
B = np.array([[10,20,30],
[5,6,7]])
cartesian_product( [A,B], axis=1 )
>> np.array([[ 1*10, 1*20, 1*30, 2*10, 2*20, 2*30 ]
[ 3*5, 3*6, 3*7, 4*5, 4*6, 4*7 ]])
and extendable to cartesian_product([A,B,C,D...], axis=1)
e.g.
A = np.array([[1,2],
[3,4]])
B = np.array([[10,20],
[5,6]])
C = np.array([[50, 0],
[60, 8]])
cartesian_product( [A,B,C], axis=1 )
>> np.array([[ 1*10*50, 1*10*0, 1*20*50, 1*20*0, 2*10*50, 2*10*0, 2*20*50, 2*20*0]
[ 3*5*60, 3*5*8, 3*6*60, 3*6*8, 4*5*60, 4*5*8, 4*6*60, 4*6*8]])
I have a working solution that essentially creates an empty (N,D) matrix and then broadcasting a vector columnwise product for each column within nested for loops for each matrix in the provided list. Clearly is horrible once the arrays get larger!
Is there an existing solution within numpy or tensorflow for this? Potentially one that is efficiently paralleizable (A tensorflow solution would be wonderful but a numpy is ok and as long as the vector logic is clear then it shouldn't be hard to make a tf equivalent)
I'm not sure if I need to use einsum, tensordot, meshgrid or some combination thereof to achieve this. I have a solution but only for single-dimension vectors from https://stackoverflow.com/a/11146645/2123721 even though that solution says to work for arbitrary dimensions array (which appears to mean vectors). With that one i can do a .prod(axis=1), but again this is only valid for vectors.
thanks!
Here's one approach to do this iteratively in an accumulating manner making use of broadcasting after extending dimensions for each pair from the list of arrays for elmentwise multiplications -
L = [A,B,C] # list of arrays
n = L[0].shape[0]
out = (L[1][:,None]*L[0][:,:,None]).reshape(n,-1)
for i in L[2:]:
out = (i[:,None]*out[:,:,None]).reshape(n,-1)

Storing multidimensional variable length array with h5py

I'm trying to store a list of variable length arrays in an HDF file with the following procedure:
phn_mfccs = []
# Import wav files
for waveform in files:
phn_mfcc = mfcc(waveform) # produces a variable length multidim array of the shape (x, 13, 1)
# Add MFCC and label to dataset
# phn_mfccs has dimension (len(files),)
# phn_mfccs[i] has variable dimension ([# of frames in ith segment] (variable), 13, 1)
phn_mfccs.append(phn_mfcc)
dt = h5py.special_dtype(vlen=np.dtype('float64'))
mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)
It seems like my datatypes aren't working out though -- instead of each element of the mfccs_out dataset containing a multidimensional array, it contains just a 1D array. e.g. if the first phn_mfcc I append originally has dimension (59,13,1), mfccs_out['phn_mfccs'][0] has dimension (59,).
I suspect it is because I'm just using a float64 datatype, and I need something else for an array of arrays? If I don't specify the dataset or try to use dtype='O', though, it spits out an error like "Object dtype 'O' has no native HDF equivalent."
Ideally, what I'd like is for mfccs_out['phn_mfccs'][i] to contain the ith phn_mfcc that I appended to the list phn_mfccs.
The essence of your code is:
phn_mfccs = []
<loop several layers>
phn_mfcc = <some sort of array expanded by one dimension>
phn_mfccs.append(phn_mfcc)
At the end of loops phn_mfccs is a list of arrays. I can't tell from the code what the dtype and shape is. Or whether it differs for each element of the list.
I'm not entirely sure what create_dataset does when given a list of arrays. It may wrap it in np.array.
mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)
What does np.array(phn_mfccs) produce? Shape, dtype? If all the elements are arrays of the same shape and dtype it will produce a higher dimensional array. If they differ in shape, it will produce a 1d array with object dtype. Given the error message, I suspect the latter.
I've answered a few vlen questions but haven't worked with it a lot
http://docs.h5py.org/en/latest/special.html
I vaguely recall that the 'ragged' dimension of a h5 array can only be 1d. So a phn_mfccs object array that contains 1d float arrays of varying dimensions might work.
I might come up with a simple example. And I suggest you construct a simpler problem that we can copy-n-paste and experiement with. We don't need to know how you read the data from your directory. We just need to understand the content of the array (list) that you are trying to write.
A 2015 post on vlen arrays
Inexplicable behavior when using vlen with h5py
H5PY - How to store many 2D arrays of different dimensions
1d ragged arrays example
In [24]: f = h5py.File('vlen.h5','w')
In [25]: dt = h5py.special_dtype(vlen=np.dtype('float64'))
In [26]: dataset = f.create_dataset('vlen',(4,), dtype=dt)
In [27]: dataset.value
Out[27]:
array([array([], dtype=float64), array([], dtype=float64),
array([], dtype=float64), array([], dtype=float64)], dtype=object)
In [28]: for i in range(4):
...: dataset[i]=np.arange(i+3)
In [29]: dataset.value
Out[29]:
array([array([ 0., 1., 2.]), array([ 0., 1., 2., 3.]),
array([ 0., 1., 2., 3., 4.]),
array([ 0., 1., 2., 3., 4., 5.])], dtype=object)
If I try to write 2d arrays to dataset I get an error
OSError: Can't prepare for writing data (Src and dest data spaces have different sizes)
The dataset itself may be multidimensional, but the vlen object has to be a 1d array of floats.

Evaluate several elements of numpy object array

I have an ndarray A that stores objects of the same type, in particular various LinearNDInterpolator objects. For example's sake assume it's just 2:
>>> A
array([ <scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe122adc750>,
<scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe11daee590>], dtype=object)
I want to be able to do two things. First, I'd like to evaluate all objects in A at a certain point and get back an ndarray of A.shape with all the values in it. Something like
>> A[[0,1]](1,1) =
array([ 1, 2])
However, I get
TypeError: 'numpy.ndarray' object is not callable
Is it possible to do that?
Second, I would like to change the interpolation values without constructing new LinearNDInterpolator objects (since the nodes stay the same). I.e., something like
A[[0,1]].values = B
where B is an ndarray containing the new values for every element of A.
Thank you for your suggestions.
The same issue, but with simpler functions:
In [221]: A=np.array([add,multiply])
In [222]: A[0](1,2) # individual elements can be called
Out[222]: 3
In [223]: A(1,2) # but not the array as a whole
---------------------------------------------------------------------------
TypeError: 'numpy.ndarray' object is not callable
We can iterate over a list of functions, or that array as well, calling each element on the parameters. Done right we can even zip a list of functions and a list of parameters.
In [224]: ll=[add,multiply]
In [225]: [x(1,2) for x in ll]
Out[225]: [3, 2]
In [226]: [x(1,2) for x in A]
Out[226]: [3, 2]
Another test, the callable function:
In [229]: callable(A)
Out[229]: False
In [230]: callable(A[0])
Out[230]: True
Can you change the interpolation values for individual Interpolators? If so, just iterate through the list and do that.
In general, dtype object arrays function like lists. They contain the same kind of object pointers. Most operations requires the same sort of iteration. Unless you need to organize the elements in multiple dimensions, dtype object arrays have few, if any advantages over lists.
Another thought - the normal array dtype is numeric or fixed length strings. These elements are not callable, so there's no need to implement a .__call__ method on these arrays. They could write something like that to operate on object dtype arrays, but the core action is a Python call. So such a function would just hide the kind of iteration that I outlined.
In another recent question I showed how to use np.char.upper to apply a string method to every element of a S dtype array. But my time tests showed that this did not speedup anything.

einsum on a sparse matrix

It seems numpy's einsum function does not work with scipy.sparse matrices. Are there alternatives to do the sorts of things einsum can do with sparse matrices?
In response to #eickenberg's answer: The particular einsum I'm wanting to is numpy.einsum("ki,kj->ij",A,A) - the sum of the outer products of the rows.
A restriction of scipy.sparse matrices is that they represent linear operators and are thus kept two dimensional, which leads to the question: Which operation are you seeking to do?
All einsum operations on a pair of 2D matrices are very easy to write without einsum using dot, transpose and pointwise operations, provided that the result does not exceed two dimensions.
So if you need a specific operation on a number of sparse matrices, it is probable that you can write it without einsum.
UPDATE: A specific way to implement np.einsum("ki, kj -> ij", A, A) is A.T.dot(A). In order to convince yourself, please try the following example:
import numpy as np
rng = np.random.RandomState(42)
a = rng.randn(3, 3)
b = rng.randn(3, 3)
the_einsum_ab = np.einsum("ki, kj -> ij", a, b)
the_a_transpose_times_b = a.T.dot(b)
# We write a test in order to assert equality
from numpy.testing import assert_array_equal
assert_array_equal(the_einsum_ab, the_a_transpose_times_b) # This passes, so equality
This result is slightly more general. Now if you use b = a you obtain your specific result.
einsum translates the index string into a calculation using the C version of np.nditer. http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html is a nice introduction to nditer. Note especially the Cython example at the end.
https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py is a Python simulation of the einsum.
scipy.sparse has its own code (ultimately in C) to perform the basic operations, summation, matrix multiplication, etc. Sparse matricies have their own data structures. They can be lists, dictionaries, or a set of numpy arrays. Numpy notation can be used because sparse has the appropriate __xxx__ methods.
A sparse matrix is a matrix, a 2d array object. A sparse einsum could be written, but it would end up using the sparse matrix multiplication, not nditer. So at best it would be a notational convenience.
Sparse csr_matrix.dot is:
def dot(self, other):
"""Ordinary dot product
...
"""
return self * other
A=sparse.csr_matrix([[1,2],[3,4]])
A.dot(A.T).A
(A*A.T).A
A.__rmul__(A.T).A
A.__mul__(A.T).A
np.einsum('ij,kj',A.A,A.A)
# array([[ 5, 11],
# [11, 25]])