numpy argmax not getting all values from generator expression - numpy

The output of the following
import numpy as np
print(np.argmax([i for i in range(0, 10)]))
print(np.argmax(i for i in range(0, 10)))
is
9
0
Why does argmax reduce the generator expression only once?

Compare these two expressions:
In [682]: np.asarray([i for i in range(3)])
Out[682]: array([0, 1, 2])
In [683]: np.asarray(i for i in range(3))
Out[683]: array(<generator object <genexpr> at 0xb367bb9c>, dtype=object)
asarray (or array) applied to a list produces an array with numbers. The same thing applied to the generator produces a dtype=object array with 1 item, the generator itself. In fact its shape is () (0d). You can recover this generator with np.array(i for i in range(3))[()]
fromiter can iterate a generator, but array only iterates on things like lists and tuples.
In [688]: np.fromiter((i for i in range(3)),int)
Out[688]: array([0, 1, 2])
And argmax depends on its input being an array.

As i have don't have required reputation amount i am adding this an answer.
As suggested by #hpaulj, np.argmax calls asarray function in numeric.py. Here the developers have mentioned this in the code:
def asarray(a, dtype=None, order=None):
"""Convert the input to an array.
Parameters
----------
a : array_like
Input data, in any form that can be converted to an array. This
includes lists, lists of tuples, tuples, tuples of tuples, tuples
of lists and ndarrays.
...
Hence your a doesn't match the requirements. Also why zero is returned for any input. This return value is from the function
result = getattr(asarray(obj), method)(*args, **kwds)
in fromnumeric.py which is called first. as for a generator object the code can't resolve the method, this function might return 0 as default
This was my research regarding the question

Related

How to compare numpy arrays of tuples?

Here's an MWE that illustrates the issue I have:
import numpy as np
arr = np.full((3, 3), -1, dtype="i,i")
doesnt_work = arr == (-1, -1)
n_arr = np.full((3, 3), -1, dtype=int)
works = n_arr == 10
arr is supposed to be an array of tuples, but it doesn't behave as expected.
works is an array of booleans, as expected, but doesnt_work is False. Is there a way to get numpy to do elementwise comparisons on more complex types, or do I have to resort to list comprehension, flatten and reshape?
There's a second problem:
f = arr[(0, 0)] == (-1, -1)
f is False, because arr[(0,0)] is of type numpy.void rather than a tuple. So even if the componentwise comparison worked, it would give the wrong result. Is there a clever numpy way to do this or should I just resort to list comprehension?
Both problems are actually the same problem! And are both related to the custom data type you created when you specified dtype="i,i".
If you run arr.dtype you will get dtype([('f0', '<i4'), ('f1', '<i4')]). That is a 2 signed integers that are placed in one continuous block of memory. This is not a python tuple. Thus it is clear why the naive comparison fails, since (-1,-1) is a python tuple and is not represented in memory the same way that the numpy data type is.
However if you compare with a_comp = np.array((-1,-1), dtype="i,i") you get the exact behavior you are expecting!
You can read more about how the custom dtype stuff works on the numpy docs:
https://numpy.org/doc/stable/reference/arrays.dtypes.html
Oh and to address what np.void is: it comes from the idea that it is a void c pointer which essentially means that it is an address to a continuous block of memory of unspecified type. But, provided you (the programer) knows what is going to be stored in that memory (in this case two back to back integers) it's fine provided you are careful (compare with the same custom data type).

how to compress lists/nested lists in hdf5

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

Writing SKLearn Regresion Coefficients To Pandas Series

I have a regression model that I fit in SKlearn's LinearRegression module:
To extract the coefficients, I used the code;
coefficients = model.coef_
It produced the following array with a shape of (1, 10):
[-4.72307152e-05 1.29731143e-04 8.75483702e-05 -6.28749019e-04
1.75096740e-04 -3.30209379e-06 1.35937650e-03 3.89048429e-11
8.48406857e-03 -1.36499030e-05]
Now, I would like to save the array to a pd.Series. I am taking the following approach:
features = ["f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10"]
model_coefs = pd.Series(coefficients, index=features)
And, the system gives me the following error:
ValueError: Length of passed values is 1, index implies 10.
What I have tried:
Transposing the underlying array, coefficients, to give it a length of 10.
Reshaping the array to give it a shape of (10,1).
But nothing seems to work. I am not sure where I am going wrong.
For your case you want to flatten the array so .ravel should do the trick for example:
pd.Series(np.zeros((1, 10)).ravel(), index=features)
It's strange the coeffs output are of shape (1, 10), when I run the base sklearn example here (with multiple features) my coeffs are of 1-d:
In [27]: regr.coef_
Out[27]:
array([ 3.03499549e-01, -2.37639315e+02, 5.10530605e+02, 3.27736980e+02,
-8.14131709e+02, 4.92814588e+02, 1.02848452e+02, 1.84606489e+02,
7.43519617e+02, 7.60951722e+01])
In [28]: regr.coef_.shape
Out[28]: (10,)

Numpy Array Shape Issue

I have initialized this empty 2d np.array
inputs = np.empty((300, 2), int)
And I am attempting to append a 2d row to it as such
inputs = np.append(inputs, np.array([1,2]), axis=0)
But Im getting
ValueError: all the input arrays must have same number of dimensions
And Numpy thinks it's a 2 row 0 dimensional object (transpose of 2d)
np.array([1, 2]).shape
(2,)
Where have I gone wrong?
To add a row to a (300,2) shape array, you need a (1,2) shape array. Note the matching 2nd dimension.
np.array([[1,2]]) works. So does np.array([1,2])[None, :] and np.atleast_2d([1,2]).
I encourage the use of np.concatenate. It forces you to think more carefully about the dimensions.
Do you really want to start with np.empty? Look at its values. They are random, and probably large.
#Divakar suggests np.row_stack. That puzzled me a bit, until I checked and found that it is just another name for np.vstack. That function passes all inputs through np.atleast_2d before doing np.concatenate. So ultimately the same solution - turn the (2,) array into a (1,2)
Numpy requires double brackets to declare an array literal, so
np.array([1,2])
needs to be
np.array([[1,2]])
If you intend to append that as the last row into inputs, you can just simply use np.row_stack -
np.row_stack((inputs,np.array([1,2])))
Please note this np.array([1,2]) is a 1D array.
You can even pass it a 2D row version for the same result -
np.row_stack((inputs,np.array([[1,2]])))

Evaluate several elements of numpy object array

I have an ndarray A that stores objects of the same type, in particular various LinearNDInterpolator objects. For example's sake assume it's just 2:
>>> A
array([ <scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe122adc750>,
<scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe11daee590>], dtype=object)
I want to be able to do two things. First, I'd like to evaluate all objects in A at a certain point and get back an ndarray of A.shape with all the values in it. Something like
>> A[[0,1]](1,1) =
array([ 1, 2])
However, I get
TypeError: 'numpy.ndarray' object is not callable
Is it possible to do that?
Second, I would like to change the interpolation values without constructing new LinearNDInterpolator objects (since the nodes stay the same). I.e., something like
A[[0,1]].values = B
where B is an ndarray containing the new values for every element of A.
Thank you for your suggestions.
The same issue, but with simpler functions:
In [221]: A=np.array([add,multiply])
In [222]: A[0](1,2) # individual elements can be called
Out[222]: 3
In [223]: A(1,2) # but not the array as a whole
---------------------------------------------------------------------------
TypeError: 'numpy.ndarray' object is not callable
We can iterate over a list of functions, or that array as well, calling each element on the parameters. Done right we can even zip a list of functions and a list of parameters.
In [224]: ll=[add,multiply]
In [225]: [x(1,2) for x in ll]
Out[225]: [3, 2]
In [226]: [x(1,2) for x in A]
Out[226]: [3, 2]
Another test, the callable function:
In [229]: callable(A)
Out[229]: False
In [230]: callable(A[0])
Out[230]: True
Can you change the interpolation values for individual Interpolators? If so, just iterate through the list and do that.
In general, dtype object arrays function like lists. They contain the same kind of object pointers. Most operations requires the same sort of iteration. Unless you need to organize the elements in multiple dimensions, dtype object arrays have few, if any advantages over lists.
Another thought - the normal array dtype is numeric or fixed length strings. These elements are not callable, so there's no need to implement a .__call__ method on these arrays. They could write something like that to operate on object dtype arrays, but the core action is a Python call. So such a function would just hide the kind of iteration that I outlined.
In another recent question I showed how to use np.char.upper to apply a string method to every element of a S dtype array. But my time tests showed that this did not speedup anything.