Cython: storing unicode in numpy array - numpy

I'm new to cython, and I've been having a re-ocurring problem involving encoding unicode inside of a numpy array.
Here's an example of the problem:
import numpy as np
cimport numpy as np
cpdef pass_array(np.ndarray[ndim=1,dtype=np.unicode] a):
pass
cpdef access_unicode_item(np.ndarray a):
cdef unicode item = a[0]
Example errors:
In [3]: unicode_array = np.array([u"array",u"of",u"unicode"],dtype=np.unicode)
In [4]: pass_array(unicode_array)
ValueError: Does not understand character buffer dtype format string ('w')
In [5]: access_item(unicode_array)
TypeError: Expected unicode, got numpy.unicode_
The problem seems to be that the values are not real unicode, but instead numpy.unicode_ . Is there a way to encode the values in the array as proper unicode (so that I can type individual items for use in cython code)?

In Py2.7
In [375]: arr=np.array([u"array",u"of",u"unicode"],dtype=np.unicode)
In [376]: arr
Out[376]:
array([u'array', u'of', u'unicode'],
dtype='<U7')
In [377]: arr.dtype
Out[377]: dtype('<U7')
In [378]: type(arr[0])
Out[378]: numpy.unicode_
In [379]: type(arr[0].item())
Out[379]: unicode
In general x[0] returns an element of x in a numpy subclass. In this case np.unicode_ is a subclass of unicode.
In [384]: isinstance(arr[0],np.unicode_)
Out[384]: True
In [385]: isinstance(arr[0],unicode)
Out[385]: True
I think you'd encounter the same sort of issues between np.int32 and int. But I haven't worked enough with cython to be sure.
Where have you seen cython code that specifies a string (unicode or byte) dtype?
http://docs.cython.org/src/tutorial/numpy.html has expressions like
# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.int
# "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For
# every type in the numpy module there's a corresponding compile-time
# type with a _t-suffix.
ctypedef np.int_t DTYPE_t
....
def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f):
The purpose of the [] part is to improve indexing efficiency.
What we need to do then is to type the contents of the ndarray objects. We do this with a special “buffer” syntax which must be told the datatype (first argument) and number of dimensions (“ndim” keyword-only argument, if not provided then one-dimensional is assumed).
I don't think np.unicode will help because it doesn't specify character length. The full string dtype has to include the number of characters, eg. <U7 in my example.
We need to find working examples which pass string arrays - either in the cython documentation or other SO cython questions.
For some operations, you could treat the unicode array as an array of int32.
In [397]: arr.nbytes
Out[397]: 84
3 strings x 7 char/string * 4bytes/char
In [398]: arr.view(np.int32).reshape(-1,7)
Out[398]:
array([[ 97, 114, 114, 97, 121, 0, 0],
[111, 102, 0, 0, 0, 0, 0],
[117, 110, 105, 99, 111, 100, 101]])
Cython gives you the greatest speed improvement when you can bypass Python functions and methods. That would include bypassing much of the Python string and unicode functionality.

Related

how to compress lists/nested lists in hdf5

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

which float precision are numpy arrays by default?

I wonder which format floats are in numpy array by default.
(or do they even get converted when declaring a np.array? if so how about python lists?)
e.g. float16,float32 or float64?
float64. You can check it like
>>> np.array([1, 2]).dtype
dtype('int64')
>>> np.array([1., 2]).dtype
dtype('float64')
If you dont specify the data type when you create the array then numpy will infer the type, from the docs
dtypedata-type, optional - The desired data-type for the array. If not given, then the type will be determined as the minimum type
required to hold the objects in the sequence

How can I combine multiple numpy arrays into a single memoryview for cython?

I have a list of varying size that contains numpy arrays with the same data type and shape. I would like to process this data using a function written in Cython without copying the data. Both memoryviews and the Python buffer protocol seem to support this kind of data using indirect for the first dimension. So I was hoping that something like this could work:
%%cython
from cython.view cimport indirect
def test(list a):
cdef double[::indirect, :] x
x = a
x[0, 0] = 42
Unfortunately it doesn't.
Is there a way to convert this list of numpy arrays into such a memoryview?

numpy argmax not getting all values from generator expression

The output of the following
import numpy as np
print(np.argmax([i for i in range(0, 10)]))
print(np.argmax(i for i in range(0, 10)))
is
9
0
Why does argmax reduce the generator expression only once?
Compare these two expressions:
In [682]: np.asarray([i for i in range(3)])
Out[682]: array([0, 1, 2])
In [683]: np.asarray(i for i in range(3))
Out[683]: array(<generator object <genexpr> at 0xb367bb9c>, dtype=object)
asarray (or array) applied to a list produces an array with numbers. The same thing applied to the generator produces a dtype=object array with 1 item, the generator itself. In fact its shape is () (0d). You can recover this generator with np.array(i for i in range(3))[()]
fromiter can iterate a generator, but array only iterates on things like lists and tuples.
In [688]: np.fromiter((i for i in range(3)),int)
Out[688]: array([0, 1, 2])
And argmax depends on its input being an array.
As i have don't have required reputation amount i am adding this an answer.
As suggested by #hpaulj, np.argmax calls asarray function in numeric.py. Here the developers have mentioned this in the code:
def asarray(a, dtype=None, order=None):
"""Convert the input to an array.
Parameters
----------
a : array_like
Input data, in any form that can be converted to an array. This
includes lists, lists of tuples, tuples, tuples of tuples, tuples
of lists and ndarrays.
...
Hence your a doesn't match the requirements. Also why zero is returned for any input. This return value is from the function
result = getattr(asarray(obj), method)(*args, **kwds)
in fromnumeric.py which is called first. as for a generator object the code can't resolve the method, this function might return 0 as default
This was my research regarding the question

Problems while learning cython

I am learning cython to speed up numpy. I wrote a code to see how to optimize numpy array calculation.
The python code is:
from numpy import *
def set_onsite(n):
a=linspace(0,n,n+1)
onsite=zeros([n+1,n+1],float)
for i in range(0,n+1):
onsite[i,i]=a[i]*a[i]
return onsite
Then, I tried to cythonize this code:
import numpy as np
cimport numpy as np
cimport cython
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def set_onsite(np.int_t n):
cdef np.ndarray[double,ndim=1,mode='c'] a=np.linspace(0,n,n+1)
cdef np.ndarray[double,ndim=2,mode='c'] onsite=np.empty(n+1,n+1)
cdef np.int_t i
for i in range(0,n+1):
onsite[i,i]=a[i]*a[i]
return onsite
After running setup.py file, I got the .so file. I ran the code %timeit myfile.set_onsite(10000),but IPython showed
TypeError: data type not understood
So could anyone tell me what is going on here?
I checked my code many times but I did not figure out where the problem arises.
The problem has nothing to do with cython; it's just that np.empty expects the first argument to be the shape given as an int or tuple of ints. The second argument is interpreted as the dtype:
In [19]: np.empty(5,5)
TypeError: data type not understood
while np.empty((5,5)) returns an empty array of shape (5,5).
So instead use
cdef np.ndarray[double,ndim=2,mode='c'] onsite=np.empty((n+1,n+1))
Note the double set of parentheses around n+1, n+1. Or, use np.zeros instead of np.empty to make the Cython function match the Python function.
PS: When debugging Python, it is helpful to note not only the error message, but the line that raises the exception:
File "comp.pyx", line 13, in comp.set_onsite (comp.c:1290)
cdef np.ndarray[double,ndim=2,mode='c'] onsite=np.empty(n+1,n+1)
TypeError: data type not understood