Irregular Numpy matrix - numpy

In Numpy, it appears that the matrix can simply be a nested list of anything not limited to numbers. For example
import numpy as np
a = [[1,2,5],[3,'r']]
b = np.matrix(a)
generates no complaints.
What is the purpose of this tolerance when list can treat the object that is not a matrix in the strict mathematical sense?

What you've created is an object dtype array:
In [302]: b=np.array([[1,2,5],[3,'r']])
In [303]: b
Out[303]: array([[1, 2, 5], [3, 'r']], dtype=object)
In [304]: b.shape
Out[304]: (2,)
In [305]: b[0]
Out[305]: [1, 2, 5]
In [306]: b[1]=None
In [307]: b
Out[307]: array([[1, 2, 5], None], dtype=object)
The elements of this array are pointers - pointers to objects else where in memory. It has a data buffer just like other arrays. In this case 2 pointers, 2
In [308]: b.__array_interface__
Out[308]:
{'data': (169809984, False),
'descr': [('', '|O')],
'shape': (2,),
'strides': None,
'typestr': '|O',
'version': 3}
In [309]: b.nbytes
Out[309]: 8
In [310]: b.itemsize
Out[310]: 4
It is very much like a list - which also stores object pointers in a buffer. But it differs in that it doesn't have an append method, but does have all the array ones like .reshape.
And for many operations, numpy treats such an array like a list - iterating over the pointers, etc. Many of the math operations that work with numeric values fail with object dtypes.
Why allow this? Partly it's just a generalization, expanding the concept of element values/dtypes beyond the simple numeric and string ones. numpy also allows compound dtypes (structured arrays). MATLAB expanded their matrix class to include cells, which are similar.
I see a lot of questions on SO about object arrays. Sometimes they are produced in error, Creating numpy array from list gives wrong shape.
Sometimes they are created intentionally. pandas readily changes a data series to object dtype to accommodate a mix of values (string, nan, int).
np.array() tries to create as high a dimension array as it can, resorting to object dtype only when it can't, for example when the sublists differ in length. In fact you have to resort to special construction methods to create an object array when the sublists are all the same.
This is still an object array, but the dimension is higher:
In [316]: np.array([[1,2,5],[3,'r',None]])
Out[316]:
array([[1, 2, 5],
[3, 'r', None]], dtype=object)
In [317]: _.shape
Out[317]: (2, 3)

Related

Newshape parameter in numpy.reshape can be a list?

The numpy official document specifies the synopsis of reshape as follows:
numpy.reshape(a, newshape, order='C')
where newshape can be an int or tuple of ints.
The document does not say newshape can be a list, but my testing indicates that newshape can be a list. For example:
a = np.array([[1,2,3],[4,5,6]])
b = a.reshape([3,2])
>>> b
array([[1, 2],
[3, 4],
[5, 6]])
Is the feature of providing newshape as a list a nonstandard extension, so that the document does not mention it?
Tuples and lists have similar properties, and often you can use arraylike objects (or iterables) like lists, tuples, sets, or NumPy arrays interchangeably. They don't write in the documentation about all possibilities. I guess they use tuple in this case because if you call the function shape it returns a tuple of ints.

How to relaibly create a multi-dimensional array and a one-dimensional view of it in numpy, so that the memory layout be contiguous?

According to the documentation of numpy.ravel,
Return a contiguous flattened array.
A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.
For convenience and efficiency of indexing, I would like to have a one-dimensional view of a 2-dimensional array. I am using ravel for creating the view, and so far so good.
However, it is not clear to me what is meant by "A copy is made only if needed." If some day a copy is created while my code is executed, the code will stop working.
I know that there is numpy.reshape, but its documentation says:
It is not always possible to change the shape of an array without copying the data.
In any case, I would like the data to be contiguous.
How can I reliably create at 2-dimensional array and a 1-dimensional view into it? I would like the data to be contiguous in memory (for efficiency). Are there any attributes to specify when creating the 2-dimensional array to assure that it is contiguous and ravel will not need to copy it?
Related question: What is the difference between flatten and ravel functions in numpy?
The warnings for ravel and reshape are the same. ravel is just reshape(-1), to 1d. Conversely reshape docs tells us that we can think of it as first doing a ravel.
Normal array construction produces a contiguous array, and reshape with the same order will produce a view. You can visually test that by looking at the ravel and checking if the values appear in the expected order.
In [348]: x = np.arange(6).reshape(2,3)
In [349]: x
Out[349]:
array([[0, 1, 2],
[3, 4, 5]])
In [350]: x.ravel()
Out[350]: array([0, 1, 2, 3, 4, 5])
I started with the arange, reshaped it to 2d, and back to 1d. No change in order.
But if I make a sliced view:
In [351]: x[:,:2]
Out[351]:
array([[0, 1],
[3, 4]])
In [352]: x[:,:2].ravel()
Out[352]: array([0, 1, 3, 4])
This ravel has a gap, and thus is a copy.
Transpose is also a view, which cannot be reshaped to a view:
In [353]: x.T
Out[353]:
array([[0, 3],
[1, 4],
[2, 5]])
In [354]: x.T.ravel()
Out[354]: array([0, 3, 1, 4, 2, 5])
Except, if we specify the right order, the ravel is a view.
In [355]: x.T.ravel(order='F')
Out[355]: array([0, 1, 2, 3, 4, 5])
reshape has a extensive discussion of order. And transpose actually works by returning a view with different shape and strides. For a 2d array transpose produces a order F array.
So as long as you are aware of manipulations like this, you can safely assume that the reshape/ravel is contiguous.
Note that even though [354] is a copy, assignment to the flat changes the original
In [361]: x[:,:2].flat[:] = [3,4,2,1]
In [362]: x
Out[362]:
array([[3, 4, 2],
[2, 1, 5]])
x[:,:2].ravel()[:] = [10,11,2,3] does not change x. In cases like this y = x[:,:2].flat may be more useful than the ravel equivalent.

Whats the difference between `arr[tuple(seq)]` and `arr[seq]`? Relating to Using a non-tuple sequence for multidimensional indexing is deprecated

I am using an ndarray to slice another ndarray.
Normally I use arr[ind_arr]. numpy seems to not like this and raises a FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use arr[tuple(seq)] instead of arr[seq].
What's the difference between arr[tuple(seq)] and arr[seq]?
Other questions on StackOverflow seem to be running into this error in scipy and pandas and most people suggest the error to be in the particular version of these packages. I am running into the warning running purely in numpy.
Example posts:
FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use `arr[tuple(seq)]` instead of `arr[seq]`
FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use `arr[tuple(seq)]`
FutureWarning with distplot in seaborn
MWE reproducing warning:
import numpy as np
# generate a random 2d array
A = np.random.randint(20, size=(7,7))
print(A, '\n')
# define indices
ind_i = np.array([1, 2, 3]) # along i
ind_j = np.array([5, 6]) # along j
# generate index array using meshgrid
ind_ij = np.meshgrid(ind_i, ind_j, indexing='ij')
B = A[ind_ij]
print(B, '\n')
C = A[tuple(ind_ij)]
print(C, '\n')
# note: both produce the same result
meshgrid returns a list of arrays:
In [50]: np.meshgrid([1,2,3],[4,5],indexing='ij')
Out[50]:
[array([[1, 1],
[2, 2],
[3, 3]]), array([[4, 5],
[4, 5],
[4, 5]])]
In [51]: np.meshgrid([1,2,3],[4,5],indexing='ij',sparse=True)
Out[51]:
[array([[1],
[2],
[3]]), array([[4, 5]])]
ix_ does the same thing, but returns a tuple:
In [52]: np.ix_([1,2,3],[4,5])
Out[52]:
(array([[1],
[2],
[3]]), array([[4, 5]]))
np.ogrid also produces the list.
In [55]: arr = np.arange(24).reshape(4,6)
indexing with the ix tuple:
In [56]: arr[_52]
Out[56]:
array([[10, 11],
[16, 17],
[22, 23]])
indexing with the meshgrid list:
In [57]: arr[_51]
/usr/local/bin/ipython3:1: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
#!/usr/bin/python3
Out[57]:
array([[10, 11],
[16, 17],
[22, 23]])
Often the meshgrid result is used with unpacking:
In [62]: I,J = np.meshgrid([1,2,3],[4,5],indexing='ij',sparse=True)
In [63]: arr[I,J]
Out[63]:
array([[10, 11],
[16, 17],
[22, 23]])
Here [I,J] is the same as [(I,J)], making a tuple of the 2 subarrays.
Basically they are trying to remove a loophole that existed for historical reasons. I don't know if they can change the handling of meshgrid results without causing further compatibility issues.
For Future Readers having this FutureWarning: ... and want to correctly resolve them.
Read the answer of #hpaulj here. And notice the point is when he(she) said:
[...] returns a list of [...]
[...] returns a tuple of [...]
If you don't know why 1. is the point please read the answer: https://stackoverflow.com/a/71487259/5290519, which is also written by him(her). This answer provides:
Why, in the first place, the usage of list as index will cause a warning.
[4] is a problem because in the past, certain lists were interpreted as though they were tuples. This is a legacy case that developers are trying to cleanup, hence the FutureWarning.
A series of concise examples on the (correct) usage of list,tuple,np.array as array index.
If you still don't get the point, try my own answer after I figuring all these complicated concepts: https://stackoverflow.com/a/71493474/5290519

what does the np.array command do?

question about the np.array command.
let's say the content of caches when you displayed it with the print command is
caches = [array([1,2,3]),array([1,2,3]),...,array([1,2,3])]
Then I executed following code:
train_x = np.array(caches)
When I print the content of train_x I have:
train_x = [[1,2,3],[1,2,3],...,[1,2,3]]
Now, the behavior is exactly as I want but do not really understand in dept what the np.array(caches) command has done. Can somebody explain this to me?
Making a 1d array
In [89]: np.array([1,2,3])
Out[89]: array([1, 2, 3])
In [90]: np.array((1,2,3))
Out[90]: array([1, 2, 3])
[1,2,3] is a list; (1,2,3) is a tuple. np.array treats them as the same. (list versus tuple does make a difference when creating structured arrays, but that's a more advanced topic.)
Note the shape is (3,) (shape is a tuple)
Making a 2d array from a nested list - a list of lists:
In [91]: np.array([[1,2],[3,4]])
Out[91]:
array([[1, 2],
[3, 4]])
In [92]: _.shape
Out[92]: (2, 2)
np.array takes data, not shape information. It infers shape from the data.
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
In these examples the object parameter is a list or list of lists. We aren't, at this stage, defining the other parameters.

If I pass a ndarray view to a function I can find its base but how can I find the slice?

numpy slicing e.g. S=np.s_[1:-1]; V=A[1:-1], produces a view of the underlying array. I can find this underlying array by V.base. If I pass such a view to a function, e.g.
def f(x):
return x.base
then f(V) == A. But how can I find the slice information S? I am looking for an attribute something like base containing information on the slice that created this view. I would like to be able to write a function to which I can pass a view of an array and return another view of the same array calculated from the view. E.g. I would like to be able to shift the view to the right or left of a one dimensional array.
As far as I know the slicing information is not stored anywhere, but you might be able to deduce it from attributes of the view and base.
For example:
In [156]: x=np.arange(10)
In [157]: y=x[3:]
In [159]: y.base
Out[159]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [160]: y.data
Out[160]: <memory at 0xb1a16b8c>
In [161]: y.base.data
Out[161]: <memory at 0xb1a16bf4>
I like the __array_interface__ value better:
In [162]: y.__array_interface__['data']
Out[162]: (163056924, False)
In [163]: y.base.__array_interface__['data']
Out[163]: (163056912, False)
So y databuffer starts 12 bytes beyond x. And since y.itemsize is 4, this means that the slicing start is 3.
In [164]: y.shape
Out[164]: (7,)
In [165]: x.shape
Out[165]: (10,)
And comparing the shapes, I deduce that the slice stop is None (the end).
For 2d arrays, or stepped slicing you'd have to look at the strides as well.
But in practice it is probably easier, and safer, to pass the slicing object (tuple, slice, etc) to your function, rather than deduce it from the results.
In [173]: S=np.s_[1:-1]
In [174]: S
Out[174]: slice(1, -1, None)
In [175]: x[S]
Out[175]: array([1, 2, 3, 4, 5, 6, 7, 8])
That is, pass S itself, rather than deduce it. I've never seen it done before.