Storing multidimensional variable length array with h5py - numpy

I'm trying to store a list of variable length arrays in an HDF file with the following procedure:
phn_mfccs = []
# Import wav files
for waveform in files:
phn_mfcc = mfcc(waveform) # produces a variable length multidim array of the shape (x, 13, 1)
# Add MFCC and label to dataset
# phn_mfccs has dimension (len(files),)
# phn_mfccs[i] has variable dimension ([# of frames in ith segment] (variable), 13, 1)
phn_mfccs.append(phn_mfcc)
dt = h5py.special_dtype(vlen=np.dtype('float64'))
mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)
It seems like my datatypes aren't working out though -- instead of each element of the mfccs_out dataset containing a multidimensional array, it contains just a 1D array. e.g. if the first phn_mfcc I append originally has dimension (59,13,1), mfccs_out['phn_mfccs'][0] has dimension (59,).
I suspect it is because I'm just using a float64 datatype, and I need something else for an array of arrays? If I don't specify the dataset or try to use dtype='O', though, it spits out an error like "Object dtype 'O' has no native HDF equivalent."
Ideally, what I'd like is for mfccs_out['phn_mfccs'][i] to contain the ith phn_mfcc that I appended to the list phn_mfccs.

The essence of your code is:
phn_mfccs = []
<loop several layers>
phn_mfcc = <some sort of array expanded by one dimension>
phn_mfccs.append(phn_mfcc)
At the end of loops phn_mfccs is a list of arrays. I can't tell from the code what the dtype and shape is. Or whether it differs for each element of the list.
I'm not entirely sure what create_dataset does when given a list of arrays. It may wrap it in np.array.
mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)
What does np.array(phn_mfccs) produce? Shape, dtype? If all the elements are arrays of the same shape and dtype it will produce a higher dimensional array. If they differ in shape, it will produce a 1d array with object dtype. Given the error message, I suspect the latter.
I've answered a few vlen questions but haven't worked with it a lot
http://docs.h5py.org/en/latest/special.html
I vaguely recall that the 'ragged' dimension of a h5 array can only be 1d. So a phn_mfccs object array that contains 1d float arrays of varying dimensions might work.
I might come up with a simple example. And I suggest you construct a simpler problem that we can copy-n-paste and experiement with. We don't need to know how you read the data from your directory. We just need to understand the content of the array (list) that you are trying to write.
A 2015 post on vlen arrays
Inexplicable behavior when using vlen with h5py
H5PY - How to store many 2D arrays of different dimensions
1d ragged arrays example
In [24]: f = h5py.File('vlen.h5','w')
In [25]: dt = h5py.special_dtype(vlen=np.dtype('float64'))
In [26]: dataset = f.create_dataset('vlen',(4,), dtype=dt)
In [27]: dataset.value
Out[27]:
array([array([], dtype=float64), array([], dtype=float64),
array([], dtype=float64), array([], dtype=float64)], dtype=object)
In [28]: for i in range(4):
...: dataset[i]=np.arange(i+3)
In [29]: dataset.value
Out[29]:
array([array([ 0., 1., 2.]), array([ 0., 1., 2., 3.]),
array([ 0., 1., 2., 3., 4.]),
array([ 0., 1., 2., 3., 4., 5.])], dtype=object)
If I try to write 2d arrays to dataset I get an error
OSError: Can't prepare for writing data (Src and dest data spaces have different sizes)
The dataset itself may be multidimensional, but the vlen object has to be a 1d array of floats.

Related

Pytorch memory model: how does "torch.from_numpy()" work? [duplicate]

This question already has answers here:
PyTorch memory model: "torch.from_numpy()" vs "torch.Tensor()"
(5 answers)
Closed 11 months ago.
I'm trying to have an in-depth understanding of how torch.from_numpy() works.
import numpy as np
import torch
arr = np.zeros((3, 3), dtype=np.float32)
t = torch.from_numpy(arr)
print("arr: {0}\nt: {1}\n".format(arr, t))
arr[0,0]=1
print("arr: {0}\nt: {1}\n".format(arr, t))
print("id(arr): {0}\nid(t): {1}".format(id(arr), id(t)))
The output looks like this:
arr: [[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
t: tensor([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
arr: [[1. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
t: tensor([[1., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
id(arr): 2360964353040
id(t): 2360964352984
This is part of the doc from torch.from_numpy():
from_numpy(ndarray) -> Tensor
Creates a :class:Tensor from a :class:numpy.ndarray.
The returned tensor and :attr:ndarray share the same memory. Modifications to
the tensor will be reflected in the :attr:ndarray and vice versa. The returned
tensor is not resizable.
And this is taken from the doc of id():
Return the identity of an object.
This is guaranteed to be unique among simultaneously existing objects.
(CPython uses the object's memory address.)
So here comes the question:
Since the ndarray arr and tensor t share the same memory, why do they have different memory addresses?
Any ideas/suggestions?
Yes, t and arr are different Python objects at different regions of memory (hence different id) but both point to the same memory address which contains the data (contiguous (usually) C array).
numpy operates on this region using C code binded to Python functions, same goes for torch (but using C++). id() doesn't know anything about the memory address of data itself, only of it's "wrappers".
EDIT: When you assign b = a (assuming a is np.array), both are references to the same Python wrapper (np.ndarray). In other words they are the same object with different name.
It's just how Python's assignment works, see documentation. All of the cases below would return True as well:
import torch
import numpy as np
tensor = torch.tensor([1,2,3])
tensor2 = tensor
id(tensor) == id(tensor2)
arr = np.array([1, 2, 3, 4, 5])
arr2 = arr
id(arr) == id(arr2)
some_str = "abba"
other_str = some_str
id(some_str) == id(other_str)
value = 0
value2 = value
id(value) == id(value2)
Now, when you use torch.from_numpy on np.ndarray you have two objects of different classes (torch.Tensor and original np.ndarray). As those are of different types they can't have the same id. One could see this case as analogous to the one below:
value = 3
string_value = str(3)
id(value) == id(string_value)
Here it's intuitive both string_value and value are two different objects at different memory locations.
EDIT 2:
All in all concepts of Python object and underlying C array have to be separated. id() doesn't know about C bindings (how could it?), but it knows about memory addresses of Python structures (torch.Tensor, np.ndarray).
In case of numpy and torch.tensor you can have following situations:
separate on Python level but using same memory region for array (torch.from_numpy)
separate on Python level and underlying memory region (one torch.tensor and another np.array). Could be created by from_numpy followed by clone() or a-like deep copy operation.
same on Python level and underlying memory region (e.g. two torch.tensor objects, one referencing another as provided above)

Tiling a tensor so I get consecutive multiples in TensorFlow

I would like a method that could turn [1,2,3] into [1,1,2,2,3,3].
My thought is something like
val = tf.constant([1.,2.,3.]) #1,2,3
tiled = tf.tile(val, 2) # [1,2,3,1,2,3]
reshaped = tf.reshape(2,3) # [[1,2,3], [1,2,3]]
transposed = tf.transpose(reshaped) # [[1,1], [2,2], [3,3]]
flattened = tf.reshape(transposed, (6,)) # [1,1,2,2,3,3]
I haven't tested the above, but it looks like it should work. But, is there a cleaner way to do it? Mine seems ugly.
The motivation is to make some sort of GMM, where I can get a 20-dim vector that is the concatenation of two 10-dim normal distributions, each multiplied by a different random variable. So if there's a different approach for that, I'm interested as well. Thanks in advance.
One alternative without tf.transpose, basically add a second dimension to your tensor, tile by the second axis and then flatten it:
t = tf.expand_dims(val, 1)
t = tf.tile(t, (1, 2))
t = tf.reshape(t, (-1,))
t.eval()
# array([ 1., 1., 2., 2., 3., 3.], dtype=float32)

scipy.linalg.block_diag vs scipy.sparse.block_diag in terms of efficiency

I have a question about the way scipy builds block diagonal matrices. I was expecting that creating a sparse block diagonal matrix would be much quicker and more efficient than creating a dense one (because of sparsity compressions). But it turns out that it's not the case (or maybe am I using some inefficient method) :
from timeit import default_timer as timer
import numpy as np
from scipy.sparse import block_diag as bd_sp
from scipy.linalg import block_diag as bd_la
m = [np.identity(1)] * 10000
before = timer()
res = bd_sp(m)
timer()-before
#takes 33.79 secs
before = timer()
res = bd_la(*m)
timer()-before
#takes 0.069 secs
What am I missing? Thank's in advance for your replies.
In [625]: [np.identity(1)*i for i in range(1,5)]
Out[625]: [array([[1.]]), array([[2.]]), array([[3.]]), array([[4.]])]
In [626]: sparse.block_diag(_)
Out[626]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [627]: _.A
Out[627]:
array([[1., 0., 0., 0.],
[0., 2., 0., 0.],
[0., 0., 3., 0.],
[0., 0., 0., 4.]])
block_diag uses bmat to join the elements. bmat makes coo matrices from all elements, and combines their attributes with offsets, and makes a new coo matrix. The code is readable Python.
It may be more efficient to construct your own data, row, col arrays. block_diag is a convenience, and fine for combining a few large matrices, but not efficient when combining many small ones.
The linalg function is also Python (and pretty short). If creates an out array of the right shape, and inserts the blocks with sliced indexing. That's an efficient dense array solution. Most of the hard work is done in compiled numpy code.
Sparse matrices can be faster when doing matrix multiplication (and related linalg solvers). For most other operations, including initialization, they are slower than equivalent dense code. They are also valuable when the problem is too big.

Unexpected difference of spsolve and solve

I need to solve linear equations with varied sizes. Sometime the size may be 0 or 1 in which cases some errors will happen. For example,
import numpy as np
from numpy.linalg import solve
from scipy.sparse.linalg import spsolve
A1 = np.array([[1,2],[2,1]])
b1 = np.array([[1],[1]])
A2 = np.array([[1]])
b2 = np.array([[1]])
Some unexpected results will happen when calling spsolve or solve:
sage: solve(A1,b1)
array([[ 0.33333333],
[ 0.33333333]])
sage: solve(A2,b2)
array([[ 1.]])
sage: spsolve(A1,b1)
array([ 0.33333333, 0.33333333])
sage: spsolve(A2,b2)
ValueError: object of too small depth for desired array
Notice that the call of "spsolve(A1,b1)" actually yields a row vector, is there anyway to force it to be a column vector? Also, the error in calling "spsolve(A2,b2)" is also very strange since the size of A1 and b1 are not zero.
spsolve does not return an 2d array but a 1d vector.
Use numpy.atleast_2d to inflate the vector, e.g., in your example
In [10]: np.atleast_2d(spsolve(A1,b1)).T
Out[10]:
array([[ 0.33333333],
[ 0.33333333]])
and .T to get a column (2d) vector. This probably also solves your second issue, related to the depth of the result vector.
(I don't use sage, so I can't reproduce your error.)

How to construct a matrix based on an array in numpy?

I am trying to do a function iteratively to an array, and make a matrix composed of what it returns. If this was native python, what I would do is:
[func(x, y) for y in xrange(Y)]
but if I do that, I need to wrap it with numpy.matrix() to vectorize it. What is the numpy way of doing this? Right now I am initializing a zeros matrix and then populating it with elements I get from a for loop, but that seems inefficient.
Take a look at the numpy tutorial, especially the part about Universal Functions or ufuncs. A ufunc is:
Functions that operate element by element on whole arrays.
which sounds like what you're asking for. Keep in mind that you probably don't need to write your own ufunc, but just write func in terms of existing ufuncs. For example:
def hypot(a, b):
return np.sqrt(a**2 + b**2)
>>> a = np.array([3., 5., 10.])
>>> b = np.array([4., 12., 24.,])
>>> hypot(a, b)
array([ 5., 13., 26.])