numpy - why Z[(0,2)] is view but Z[(0, 2), (0)] is copy? - numpy

Question
Why are the numpy tuple indexing behaviors inconsistent? Please explain the rational or design decision behind these behaviors. In my understanding, Z[(0,2)] and Z[(0, 2), (0)] are both tuple indexing and expected the consistent behavior for copy/view. If this is incorrect, please explain,
import numpy as np
Z = np.arange(36).reshape(3, 3, 4)
print("Z is \n{}\n".format(Z))
b = Z[
(0,2) # Select Z[0][2]
]
print("Tuple indexing Z[(0,2)] is \n{}\nIs view? {}\n".format(
b,
b.base is not None
))
c = Z[ # Select Z[0][0][1] & Z[0][2][1]
(0,2),
(0)
]
print("Tuple indexing Z[(0, 2), (0)] is \n{}\nIs view? {}\n".format(
c,
c.base is not None
))
Z is
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
[[24 25 26 27]
[28 29 30 31]
[32 33 34 35]]]
Tuple indexing Z[(0,2)] is
[ 8 9 10 11]
Is view? True
Tuple indexing Z[(0, 2), (0)] is
[[ 0 1 2 3]
[24 25 26 27]]
Is view? False
Numpy indexing is confusing and wonder how people built the understanding. If there is a good way to understand or cheat-sheets, please advise.

It's the comma that creates a tuple. The () just set boundaries where needed.
Thus
Z[(0,2)]
Z[0,2]
are the same, select on the first 2 dimension. Whether that returns an element, or an array depends on how many dimensions Z has.
The same interpretation applies to the other case.
Z[(0, 2), (0)]
Z[( np.array([0,2]), 0)]
Z[ np.array([0,2]), 0]
are the same - the first dimensions is indexed with a list/array, and thus is advanced indexing. It's a copy.
[ 8 9 10 11]
is a row of the 3d array; its a contiguous block of Z
[[ 0 1 2 3]
[24 25 26 27]]
is 2 rows from Z. They aren't contiguous, so there's no way of identifying them with just shape and strides (and offset in the databuffer).
details
__array_interface__ gives details about the underlying data of an array
In [146]: Z = np.arange(36).reshape(3,3,4)
In [147]: Z.__array_interface__
Out[147]:
{'data': (38255712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 3, 4),
'version': 3}
In [148]: Z.strides
Out[148]: (96, 32, 8)
For the view:
In [149]: Z1 = Z[0,2]
In [150]: Z1
Out[150]: array([ 8, 9, 10, 11])
In [151]: Z1.__array_interface__
Out[151]:
{'data': (38255776, False), # 38255712+8*8
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (4,),
'version': 3}
The data buffer pointer is 8 elements further along in Z buffer. Shape is much reduced.
In [152]: Z2 = Z[[0,2],0]
In [153]: Z2
Out[153]:
array([[ 0, 1, 2, 3],
[24, 25, 26, 27]])
In [154]: Z2.__array_interface__
Out[154]:
{'data': (31443104, False), # an entirely different location
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 4),
'version': 3}
Z2 is the same as two selections:
In [158]: Z[0,0]
Out[158]: array([0, 1, 2, 3])
In [159]: Z[2,0]
Out[159]: array([24, 25, 26, 27])
It is not
Z[0][0][1] & Z[0][2][1]
Z[0,0,1] & Z[0,2,1]
Compare that with a 2 row slice:
In [156]: Z3 = Z[0:2,0]
In [157]: Z3.__array_interface__
Out[157]:
{'data': (38255712, False), # same as Z's
'strides': (96, 8),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 4),
'version': 3}
A view is returned if the new array can be described with shape, strides and all or part of the original data buffer.

Related

Given a (nested) view into a numpy 2D array, how to retrive the coords w.r.t. the original array

Consider the following:
A = np.zeros((100,100)) # TODO: populate A
filt = median_filter(A, size=5) # doesn't impact A.shape
view = filt[30:40, 30:40]
subvew = view[0:5, 0:5]
Is it possible to extract from subview the corresponding rectangle within A?
I'd like to do something like:
coords = get_rect(subview)
rect_A = A[coords]
But if I'm constantly having to pass bounding-rects thru the system the code uglifies fast.
numpy must store this information internally, but is it possible to access it?
PS I'm not doing anything fancy like view = A[::2]
PPS From reviewing the excellent answer, it looks like it should be possible to subclass numpy.ndarray, adding a .parent property and a .get_global_rect() method. But it looks like a HARD task.
In [40]: x = np.arange(24).reshape(4,6)
__array_interface__ is a way of viewing everything about a numpy array.
In [41]: x.__array_interface__
Out[41]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (4, 6),
'version': 3}
In [42]: x.strides
Out[42]: (48, 8)
For a view:
In [43]: y = x[:3,1:4]
In [44]: y
Out[44]:
array([[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]])
In [45]: y.__array_interface__
Out[45]:
{'data': (43385720, False),
'strides': (48, 8),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 3),
'version': 3}
In [46]: y.base
Out[46]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23])
x.base is the same, the original np.arange(24).
The key difference in y is the shape, and data value, which "points" 8 bytes further along.
So while one could, in theory, deduce the indexing used to create y, numpy does not have a function or method to do that for us. Keeping track of your own "coordinates" is the best option.
Another way to put it, y is a numpy.ndarray, just like x. It does not carry any extra information about how it was created. The same applies to z, a view of y.
As for the 1d base
In [48]: x.base.strides
Out[48]: (8,)
In [49]: x.base.shape
Out[49]: (24,)
In [50]: x.base.__array_interface__
Out[50]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (24,),
'version': 3}

how numpy arrays are stored in memory locations?

a=np.array([[1,2],[4,5]])
b=a.T
print(a is b)
print(np.shares_memory(a,b))
for i in a:
for j in I:
print(i,j,id(j))
print('************')
for i in b:
for j in I:
print(i,j,id(j))
The output of the above code is
False
True
[1 2] 1 2027214431408
[1 2] 2 2027214431184
[4 5] 4 2027214431408
[4 5] 5 2027214431184
************
[1 4] 1 2027214431632
[1 4] 4 2027214431184
[2 5] 2 2027214431632
[2 5] 5 2027214431184
My question is why the location of alternate integer objects are the same in the above code. As python initializes different memory locations to each different objects
In [64]: a=np.array([[1,2],[4,5]])
...: b=a.T
The data value from __array_interface__ tells us (in some sense) where the data-buffer of the array is located. a and b has the same value, indicating that they share the buffer.
In [65]: a.__array_interface__
Out[65]:
{'data': (74597280, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 2),
'version': 3}
In [66]: b.__array_interface__
Out[66]:
{'data': (74597280, False),
'strides': (8, 16),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 2),
'version': 3}
b is a view of a, using the same buffer, but with its own shape and strides.
In [67]: a.strides
Out[67]: (16, 8)
In [68]: b.strides
Out[68]: (8, 16)
Asking for the id of an indexed element tells us nothing about that data-buffer. It just identifies the object that was derived from the array - by value, not by reference. list stores elements by reference, arrays do not.
In [70]: type(a[0,1])
Out[70]: numpy.int64

numpy - is there a way to apply sum() for iterative append to an array

For Python iterables, sum() is applicable to append multiple slices from left to right.
import numpy as np
_list = list(range(15))
print("iterables is {}".format(_list))
print(sum(
[ _list[_slice] for _slice in np.s_[1:3, 5:7, 9:11] ],
start=[]
))
---
List is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
[1, 2, 5, 6, 9, 10]
It cannot be simply apply to numpy array.
import numpy as np
_list = np.arange(15)
print("List is {}\n".format(_list))
print(sum(
[ _list[_slice] for _slice in np.s_[1:3, 5:7, 9:11] ],
start=[]
))
---
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-a9d278e659c8> in <module>
3 print("List is {}\n".format(_list))
4
----> 5 print(sum(
6 [ _list[_slice] for _slice in np.s_[1:3, 5:7, 9:11] ],
7 start=[]
ValueError: operands could not be broadcast together with shapes (0,) (2,)
I suppose numpy way is something like below.
import numpy as np
a = np.arange(15).astype(np.int32)
print("array is {}\n".format(a))
print([a[_slice] for _slice in slices])
np.concatenate([a[_slice] for _slice in slices])
---
array is [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[array([1, 2], dtype=int32), array([5, 6], dtype=int32), array([ 9, 10], dtype=int32)]
array([ 1, 2, 5, 6, 9, 10], dtype=int32)
Question
Is there a way to be able to apply sum(). Is there better way other than np.concatenate?
In [38]: np.s_[1:3, 5:7, 9:11]
Out[38]: (slice(1, 3, None), slice(5, 7, None), slice(9, 11, None))
np.r_ can make a composite index - basically a concatenate of aranges:
In [39]: np.r_[1:3, 5:7, 9:11]
Out[39]: array([ 1, 2, 5, 6, 9, 10])
Alternatively, create the slice objects, index and concatenate:
In [40]: x = np.s_[1:3, 5:7, 9:11]
In [41]: y = np.arange(20)
In [42]: np.concatenate([y[s] for s in x])
Out[42]: array([ 1, 2, 5, 6, 9, 10])
When I looked at this in the past, performance is similar.
Ways of creating the indices with list join:
In [46]: list(range(1,3))+list(range(5,7))+list(range(9,11))
Out[46]: [1, 2, 5, 6, 9, 10]
In [50]: sum([list(range(i,j)) for i,j in [(1,3),(5,7),(9,11)]],start=[])
Out[50]: [1, 2, 5, 6, 9, 10]
sum(..., start=[]) is just a list way of concatenating, using the + definition for lists.
In [55]: alist = []
In [56]: for i,j in [(1,3),(5,7),(9,11)]: alist.extend(range(i,j))
In [57]: alist
Out[57]: [1, 2, 5, 6, 9, 10]

Transforming a sequence of integers into the binary representation of that sequence's strides [duplicate]

I'm looking for a way to select multiple slices from a numpy array at once. Say we have a 1D data array and want to extract three portions of it like below:
data_extractions = []
for start_index in range(0, 3):
data_extractions.append(data[start_index: start_index + 5])
Afterwards data_extractions will be:
data_extractions = [
data[0:5],
data[1:6],
data[2:7]
]
Is there any way to perform above operation without the for loop? Some sort of indexing scheme in numpy that would let me select multiple slices from an array and return them as that many arrays, say in an n+1 dimensional array?
I thought maybe I can replicate my data and then select a span from each row, but code below throws an IndexError
replicated_data = np.vstack([data] * 3)
data_extractions = replicated_data[[range(3)], [slice(0, 5), slice(1, 6), slice(2, 7)]
You can use the indexes to select the rows you want into the appropriate shape.
For example:
data = np.random.normal(size=(100,2,2,2))
# Creating an array of row-indexes
indexes = np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
# data[indexes] will return an element of shape (3,5,2,2,2). Converting
# to list happens along axis 0
data_extractions = list(data[indexes])
np.all(data_extractions[1] == data[1:6])
True
The final comparison is against the original data.
stride_tricks can do that
a = np.arange(10)
b = np.lib.stride_tricks.as_strided(a, (3, 5), 2 * a.strides)
b
# array([[0, 1, 2, 3, 4],
# [1, 2, 3, 4, 5],
# [2, 3, 4, 5, 6]])
Please note that b references the same memory as a, in fact multiple times (for example b[0, 1] and b[1, 0] are the same memory address). It is therefore safest to make a copy before working with the new structure.
nd can be done in a similar fashion, for example 2d -> 4d
a = np.arange(16).reshape(4, 4)
b = np.lib.stride_tricks.as_strided(a, (3,3,2,2), 2*a.strides)
b.reshape(9,2,2) # this forces a copy
# array([[[ 0, 1],
# [ 4, 5]],
# [[ 1, 2],
# [ 5, 6]],
# [[ 2, 3],
# [ 6, 7]],
# [[ 4, 5],
# [ 8, 9]],
# [[ 5, 6],
# [ 9, 10]],
# [[ 6, 7],
# [10, 11]],
# [[ 8, 9],
# [12, 13]],
# [[ 9, 10],
# [13, 14]],
# [[10, 11],
# [14, 15]]])
In this post is an approach with strided-indexing scheme using np.lib.stride_tricks.as_strided that basically creates a view into the input array and as such is pretty efficient for creation and being a view occupies nomore memory space.
Also, this works for ndarrays with generic number of dimensions.
Here's the implementation -
def strided_axis0(a, L):
# Store the shape and strides info
shp = a.shape
s = a.strides
# Compute length of output array along the first axis
nd0 = shp[0]-L+1
# Setup shape and strides for use with np.lib.stride_tricks.as_strided
# and get (n+1) dim output array
shp_in = (nd0,L)+shp[1:]
strd_in = (s[0],) + s
return np.lib.stride_tricks.as_strided(a, shape=shp_in, strides=strd_in)
Sample run for a 4D array case -
In [44]: a = np.random.randint(11,99,(10,4,2,3)) # Array
In [45]: L = 5 # Window length along the first axis
In [46]: out = strided_axis0(a, L)
In [47]: np.allclose(a[0:L], out[0]) # Verify outputs
Out[47]: True
In [48]: np.allclose(a[1:L+1], out[1])
Out[48]: True
In [49]: np.allclose(a[2:L+2], out[2])
Out[49]: True
You can slice your array with a prepared slicing array
a = np.array(list('abcdefg'))
b = np.array([
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]
])
a[b]
However, b doesn't have to generated by hand in this way. It can be more dynamic with
b = np.arange(5) + np.arange(3)[:, None]
In the general case you have to do some sort of iteration - and concatenation - either when constructing the indexes or when collecting the results. It's only when the slicing pattern is itself regular that you can use a generalized slicing via as_strided.
The accepted answer constructs an indexing array, one row per slice. So that is iterating over the slices, and arange itself is a (fast) iteration. And np.array concatenates them on a new axis (np.stack generalizes this).
In [264]: np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
Out[264]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
indexing_tricks convenience methods to do the same thing:
In [265]: np.r_[0:5, 1:6, 2:7]
Out[265]: array([0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6])
This takes the slicing notation, expands it with arange and concatenates. It even lets me expand and concatenate into 2d
In [269]: np.r_['0,2',0:5, 1:6, 2:7]
Out[269]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [270]: data=np.array(list('abcdefghijk'))
In [272]: data[np.r_['0,2',0:5, 1:6, 2:7]]
Out[272]:
array([['a', 'b', 'c', 'd', 'e'],
['b', 'c', 'd', 'e', 'f'],
['c', 'd', 'e', 'f', 'g']],
dtype='<U1')
In [273]: data[np.r_[0:5, 1:6, 2:7]]
Out[273]:
array(['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'f', 'c', 'd', 'e',
'f', 'g'],
dtype='<U1')
Concatenating results after indexing also works.
In [274]: np.stack([data[0:5],data[1:6],data[2:7]])
My memory from other SO questions is that relative timings are in the same order of magnitude. It may vary for example with the number of slices versus their length. Overall the number of values that have to be copied from source to target will be the same.
If the slices vary in length, you'd have to use the flat indexing.
No matter which approach you choose, if 2 slices contain same element, it doesn't support mathematical operations correctly unlesss you use ufunc.at which can be more inefficient than loop. For testing:
def as_strides(arr, window_size, stride, writeable=False):
'''Get a strided sub-matrices view of a 4D ndarray.
Args:
arr (ndarray): input array with shape (batch_size, m1, n1, c).
window_size (tuple): with shape (m2, n2).
stride (tuple): stride of windows in (y_stride, x_stride).
writeable (bool): it is recommended to keep it False unless needed
Returns:
subs (view): strided window view, with shape (batch_size, y_nwindows, x_nwindows, m2, n2, c)
See also numpy.lib.stride_tricks.sliding_window_view
'''
batch_size = arr.shape[0]
m1, n1, c = arr.shape[1:]
m2, n2 = window_size
y_stride, x_stride = stride
view_shape = (batch_size, 1 + (m1 - m2) // y_stride,
1 + (n1 - n2) // x_stride, m2, n2, c)
strides = (arr.strides[0], y_stride * arr.strides[1],
x_stride * arr.strides[2]) + arr.strides[1:]
subs = np.lib.stride_tricks.as_strided(arr,
view_shape,
strides=strides,
writeable=writeable)
return subs
import numpy as np
np.random.seed(1)
Xs = as_strides(np.random.randn(1, 5, 5, 2), (3, 3), (2, 2), writeable=True)[0]
print('input\n0,0\n', Xs[0, 0])
np.add.at(Xs, np.s_[:], 5)
print('unbuffered sum output\n0,0\n', Xs[0,0])
np.add.at(Xs, np.s_[:], -5)
Xs = Xs + 5
print('normal sum output\n0,0\n', Xs[0, 0])
We can use list comprehension for this
data=np.array([1,2,3,4,5,6,7,8,9,10])
data_extractions=[data[b:b+5] for b in [1,2,3,4,5]]
data_extractions
Results
[array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]), array([4, 5, 6, 7, 8]), array([5, 6, 7, 8, 9]), array([ 6, 7, 8, 9, 10])]

Multidimensional numpy.outer without flatten

x is N by M matrix.
y is 1 by L vector.
I want to return "outer product" between x and y, let's call it z.
z[n,m,l] = x[n,m] * y[l]
I could probably do this using einsum.
np.einsum("ij,k->ijk", x[:, :, k], y[:, k])
or reshape afterwards.
np.outer(x[:, :, k], y).reshape((x.shape[0],x.shape[1],y.shape[0]))
But I'm thinking of doing this in np.outer only or something seems simpler, memory efficient.
Is there a way?
It's one of those numpy "can't know unless you happen to know" bits: np.outer flattens multidimensional inputs while np.multiply.outer doesn't:
m,n,l = 3,4,5
x = np.arange(m*n).reshape(m,n)
y = np.arange(l)
np.multiply.outer(x,y).shape
# (3, 4, 5)
The code for outer is:
multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
As its docs says, it flattens (i.e. ravel). If the arrays are already 1d, that expression could be written as
a[:,None] * b[None,:]
a[:,None] * b # broadcasting auto adds the None to b
We could apply broadcasting rules to your (n,m)*(1,l):
In [2]: x = np.arange(12).reshape(3,4); y = np.array([[1,2]])
In [3]: x.shape, y.shape
Out[3]: ((3, 4), (1, 2))
You want a (n,m,l), which a (n,m,1) * (1,1,l) achieves. We need to add a trailing dimension to x. The extra leading 1 on y is automatic:
In [4]: z = x[...,None]*y
In [5]: z.shape
Out[5]: (3, 4, 2)
In [6]: z
Out[6]:
array([[[ 0, 0],
[ 1, 2],
[ 2, 4],
[ 3, 6]],
[[ 4, 8],
[ 5, 10],
[ 6, 12],
[ 7, 14]],
[[ 8, 16],
[ 9, 18],
[10, 20],
[11, 22]]])
Using einsum:
In [8]: np.einsum('nm,kl->nml', x, y).shape
Out[8]: (3, 4, 2)
The fact that you approved:
In [9]: np.multiply.outer(x,y).shape
Out[9]: (3, 4, 1, 2)
suggests y isn't really (1,l) but rather (l,)`. Adjust for either is easy.
I don't think there's much difference in memory efficiency among these. In this small example In[4] is fastest, but not by much.