Here's an MWE that illustrates the issue I have:
import numpy as np
arr = np.full((3, 3), -1, dtype="i,i")
doesnt_work = arr == (-1, -1)
n_arr = np.full((3, 3), -1, dtype=int)
works = n_arr == 10
arr is supposed to be an array of tuples, but it doesn't behave as expected.
works is an array of booleans, as expected, but doesnt_work is False. Is there a way to get numpy to do elementwise comparisons on more complex types, or do I have to resort to list comprehension, flatten and reshape?
There's a second problem:
f = arr[(0, 0)] == (-1, -1)
f is False, because arr[(0,0)] is of type numpy.void rather than a tuple. So even if the componentwise comparison worked, it would give the wrong result. Is there a clever numpy way to do this or should I just resort to list comprehension?
Both problems are actually the same problem! And are both related to the custom data type you created when you specified dtype="i,i".
If you run arr.dtype you will get dtype([('f0', '<i4'), ('f1', '<i4')]). That is a 2 signed integers that are placed in one continuous block of memory. This is not a python tuple. Thus it is clear why the naive comparison fails, since (-1,-1) is a python tuple and is not represented in memory the same way that the numpy data type is.
However if you compare with a_comp = np.array((-1,-1), dtype="i,i") you get the exact behavior you are expecting!
You can read more about how the custom dtype stuff works on the numpy docs:
https://numpy.org/doc/stable/reference/arrays.dtypes.html
Oh and to address what np.void is: it comes from the idea that it is a void c pointer which essentially means that it is an address to a continuous block of memory of unspecified type. But, provided you (the programer) knows what is going to be stored in that memory (in this case two back to back integers) it's fine provided you are careful (compare with the same custom data type).
I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)
I am having a problem in my code with non contiguous arrays.
In particular I get the following warning message:
C:\Program Files\Anaconda2\lib\site-packages\skimage\util\shape.py:247: RuntimeWarning: Cannot provide views on a non-contiguous input array without copying.
warn(RuntimeWarning("Cannot provide views on a non-contiguous input "
I am using np.column_stack
import numpy as np
x = np.array([1,2,3,4])
y = np.array([5,6,7,8])
stack = np.column_stack((x,y))
stack.flags.f_contiguous
Out[2]: False
but I get a non contiguous array
Do you know how can I get contigous array? should I use always ascontiguousarray after column_stack?
np.stack([x, y]) is not contiguous. However, np.stack([x, y]).T is.
np.stack([x, y]) # Transpose of what you want and not contiguous
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Instead:
stack = np.stack([x, y]).T
In [276]: xy=np.column_stack((x,y))
In [277]: np.info(xy)
class: ndarray
shape: (4, 2)
strides: (8, 4)
itemsize: 4
aligned: True
contiguous: True
fortran: False
data pointer: 0xa836ec0
byteorder: little
byteswap: False
type: int32
The skimage code, https://github.com/scikit-image/scikit-image/blob/master/skimage/util/shape.py
# -- build rolling window view
if not arr_in.flags.contiguous:
warn(RuntimeWarning("Cannot provide views on a non-contiguous input "
"array without copying."))
arr_in = np.ascontiguousarray(arr_in)
That test, on the column_stack is ok:
In [278]: xy.flags.contiguous
Out[278]: True
In [279]: xy.T.flags.contiguous
Out[279]: False
Normally constructed 2d arrays are contiguous. But their transpose is F-contiguous. The warning is that np.ascontiguousarray will produce a copy. For very large arrays that could be a problem.
If this warning comes up often you could either suppress it, or routinely use ascontiguousarray before calling this function.
RuntimeWarning: Cannot provide views on a non-contiguous input array without copying
numpy slicing e.g. S=np.s_[1:-1]; V=A[1:-1], produces a view of the underlying array. I can find this underlying array by V.base. If I pass such a view to a function, e.g.
def f(x):
return x.base
then f(V) == A. But how can I find the slice information S? I am looking for an attribute something like base containing information on the slice that created this view. I would like to be able to write a function to which I can pass a view of an array and return another view of the same array calculated from the view. E.g. I would like to be able to shift the view to the right or left of a one dimensional array.
As far as I know the slicing information is not stored anywhere, but you might be able to deduce it from attributes of the view and base.
For example:
In [156]: x=np.arange(10)
In [157]: y=x[3:]
In [159]: y.base
Out[159]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [160]: y.data
Out[160]: <memory at 0xb1a16b8c>
In [161]: y.base.data
Out[161]: <memory at 0xb1a16bf4>
I like the __array_interface__ value better:
In [162]: y.__array_interface__['data']
Out[162]: (163056924, False)
In [163]: y.base.__array_interface__['data']
Out[163]: (163056912, False)
So y databuffer starts 12 bytes beyond x. And since y.itemsize is 4, this means that the slicing start is 3.
In [164]: y.shape
Out[164]: (7,)
In [165]: x.shape
Out[165]: (10,)
And comparing the shapes, I deduce that the slice stop is None (the end).
For 2d arrays, or stepped slicing you'd have to look at the strides as well.
But in practice it is probably easier, and safer, to pass the slicing object (tuple, slice, etc) to your function, rather than deduce it from the results.
In [173]: S=np.s_[1:-1]
In [174]: S
Out[174]: slice(1, -1, None)
In [175]: x[S]
Out[175]: array([1, 2, 3, 4, 5, 6, 7, 8])
That is, pass S itself, rather than deduce it. I've never seen it done before.
This question already has answers here:
NumPy array initialization (fill with identical values)
(9 answers)
Closed 9 years ago.
I know how to fill with zero in a 100-elements array:
np.zeros(100)
But what if I want to fill it with 9?
You can use:
a = np.empty(100)
a.fill(9)
or also if you prefer the slicing syntax:
a[...] = 9
np.empty is the same as np.ones, etc. but can be a bit faster since it doesn't initialize the data.
In newer numpy versions (1.8 or later), you also have:
np.full(100, 9)
If you just want same value throughout the array and you never want to change it, you could trick the stride by making it zero. By this way, you would just take memory for a single value. But you would get a virtual numpy array of any size and shape.
>>> import numpy as np
>>> from numpy.lib import stride_tricks
>>> arr = np.array([10])
>>> stride_tricks.as_strided(arr, (10, ), (0, ))
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
But, note that if you modify any one of the elements, all the values in the array would get modified.
This question has been discussed in some length earlier, see NumPy array initialization (fill with identical values) , also for which method is fastest.
As far as I can see there is no dedicated function to fill an array with 9s. So you should create an empty (ie uninitialized) array using np.empty(100) and fill it with 9s or whatever in a loop.