How to reconstruct array from np.unique - numpy

From document of unique, I know I can get return_index and return_inverse. It seems like COO matrix compressing.
I wonder how can I reconstruct the array by numpy. I find I can do it with np.fromfunction, but I must write a method to determine every item of the result array. I want to know if there is a easy/neat way.

Not sure if np.fromfunction is needed. The documentation actually gives you the example to reconstruct the array from return_inverse:
https://numpy.org/doc/stable/reference/generated/numpy.unique.html
a = np.array([1, 2, 6, 4, 2, 3, 2])
u, indices = np.unique(a, return_inverse=True)
u # ==> array([1, 2, 3, 4, 6])
indices # ==> array([0, 1, 4, 3, 1, 2, 1])
u[indices] # ==> array([1, 2, 6, 4, 2, 3, 2])

Related

pytorch tensor indices is confusing [duplicate]

I am trying to access a pytorch tensor by a matrix of indices and I recently found this bit of code that I cannot find the reason why it is not working.
The code below is split into two parts. The first half proves to work, whilst the second trips an error. I fail to see the reason why. Could someone shed some light on this?
import torch
import numpy as np
a = torch.rand(32, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # WORKS for a torch.tensor of size M >= 32. It doesn't work otherwise.
a = torch.rand(16, 16)
m, n = a.shape
xx, yy = np.meshgrid(np.arange(m), np.arange(m))
result = a[xx] # IndexError: too many indices for tensor of dimension 2
and if I change a = np.random.rand(16, 16) it does work as well.
To whoever comes looking for an answer: it looks like its a bug in pyTorch.
Indexing using numpy arrays is not well defined, and it works only if tensors are indexed using tensors. So, in my example code, this works flawlessly:
a = torch.rand(M, N)
m, n = a.shape
xx, yy = torch.meshgrid(torch.arange(m), torch.arange(m), indexing='xy')
result = a[xx] # WORKS
I made a gist to check it, and it's available here
First, let me give you a quick insight into the idea of indexing a tensor with a numpy array and another tensor.
Example: this is our target tensor to be indexed
numpy_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # numpy array
tensor_indices = torch.tensor([[0, 1, 2, 7],
[0, 1, 2, 3]]) # 2D tensor
t = torch.tensor([[1, 2, 3, 4], # targeted tensor
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20],
[21, 22, 23, 24],
[25, 26, 27, 28],
[29, 30, 31, 32]])
numpy_result = t[numpy_indices]
tensor_result = t[tensor_indices]
Indexing using a 2D numpy array: the index is read like pairs (x,y) tensor[row,column] e.g. t[0,0], t[1,1], t[2,2], and t[7,3].
print(numpy_result) # tensor([ 1, 6, 11, 32])
Indexing using a 2D tensor: walks through the index tensor in a row-wise manner and each value is an index of a row in the targeted tensor.
e.g. [ [t[0],t[1],t[2],[7]] , [[0],[1],[2],[3]] ] see the example below, the new shape of tensor_result after indexing is (tensor_indices.shape[0],tensor_indices.shape[1],t.shape[1])=(2,4,4).
print(tensor_result) # tensor([[[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [29, 30, 31, 32]],
# [[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12],
# [ 13, 14, 15, 16]]])
If you try to add a third row in numpy_indices, you will get the same error you have because the index will be represented by 3D e.g., (0,0,0)...(7,3,3).
indices = np.array([[0, 1, 2, 7],
[0, 1, 2, 3],
[0, 1, 2, 3]])
print(numpy_result) # IndexError: too many indices for tensor of dimension 2
However, this is not the case with indexing by tensor and the shape will be bigger (3,4,4).
Finally, as you see the outputs of the two types of indexing are completely different. To solve your problem, you can use
xx = torch.tensor(xx).long() # convert a numpy array to a tensor
What happens in the case of advanced indexing (rows of numpy_indices > 3 ) as your situation is still ambiguous and unsolved and you can check 1 , 2, 3.

Tensorflow filter operation on dataset with several columns

I want to create a subset of my data by applying tf.data.Dataset filter operation. I have this data:
data = tf.convert_to_tensor([[1, 2, 1, 1, 5, 5, 9, 12], [1, 2, 3, 8, 4, 5, 9, 12]])
dataset = tf.data.Dataset.from_tensor_slices(data)
I want to retrieve a subset of 'dataset' which corresponds to all elements whose first column is equal to 1. So, result should be:
[[1, 1, 1], [1, 3, 8]] # dtype : dataset
I tried this:
subset = dataset.filter(lambda x: tf.equal(x[0], 1))
But I don't get the correct result, since it sends me back x[0]
Someone to help me ?
I finally resolved it:
a = tf.convert_to_tensor([1, 2, 1, 1, 5, 5, 9, 12])
b = tf.convert_to_tensor([1, 2, 3, 8, 4, 5, 9, 12])
data_set = tf.data.Dataset.from_tensor_slices((a, b))
subset = data_set.filter(lambda x, y: tf.equal(x, 1))

Transforming a sequence of integers into the binary representation of that sequence's strides [duplicate]

I'm looking for a way to select multiple slices from a numpy array at once. Say we have a 1D data array and want to extract three portions of it like below:
data_extractions = []
for start_index in range(0, 3):
data_extractions.append(data[start_index: start_index + 5])
Afterwards data_extractions will be:
data_extractions = [
data[0:5],
data[1:6],
data[2:7]
]
Is there any way to perform above operation without the for loop? Some sort of indexing scheme in numpy that would let me select multiple slices from an array and return them as that many arrays, say in an n+1 dimensional array?
I thought maybe I can replicate my data and then select a span from each row, but code below throws an IndexError
replicated_data = np.vstack([data] * 3)
data_extractions = replicated_data[[range(3)], [slice(0, 5), slice(1, 6), slice(2, 7)]
You can use the indexes to select the rows you want into the appropriate shape.
For example:
data = np.random.normal(size=(100,2,2,2))
# Creating an array of row-indexes
indexes = np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
# data[indexes] will return an element of shape (3,5,2,2,2). Converting
# to list happens along axis 0
data_extractions = list(data[indexes])
np.all(data_extractions[1] == data[1:6])
True
The final comparison is against the original data.
stride_tricks can do that
a = np.arange(10)
b = np.lib.stride_tricks.as_strided(a, (3, 5), 2 * a.strides)
b
# array([[0, 1, 2, 3, 4],
# [1, 2, 3, 4, 5],
# [2, 3, 4, 5, 6]])
Please note that b references the same memory as a, in fact multiple times (for example b[0, 1] and b[1, 0] are the same memory address). It is therefore safest to make a copy before working with the new structure.
nd can be done in a similar fashion, for example 2d -> 4d
a = np.arange(16).reshape(4, 4)
b = np.lib.stride_tricks.as_strided(a, (3,3,2,2), 2*a.strides)
b.reshape(9,2,2) # this forces a copy
# array([[[ 0, 1],
# [ 4, 5]],
# [[ 1, 2],
# [ 5, 6]],
# [[ 2, 3],
# [ 6, 7]],
# [[ 4, 5],
# [ 8, 9]],
# [[ 5, 6],
# [ 9, 10]],
# [[ 6, 7],
# [10, 11]],
# [[ 8, 9],
# [12, 13]],
# [[ 9, 10],
# [13, 14]],
# [[10, 11],
# [14, 15]]])
In this post is an approach with strided-indexing scheme using np.lib.stride_tricks.as_strided that basically creates a view into the input array and as such is pretty efficient for creation and being a view occupies nomore memory space.
Also, this works for ndarrays with generic number of dimensions.
Here's the implementation -
def strided_axis0(a, L):
# Store the shape and strides info
shp = a.shape
s = a.strides
# Compute length of output array along the first axis
nd0 = shp[0]-L+1
# Setup shape and strides for use with np.lib.stride_tricks.as_strided
# and get (n+1) dim output array
shp_in = (nd0,L)+shp[1:]
strd_in = (s[0],) + s
return np.lib.stride_tricks.as_strided(a, shape=shp_in, strides=strd_in)
Sample run for a 4D array case -
In [44]: a = np.random.randint(11,99,(10,4,2,3)) # Array
In [45]: L = 5 # Window length along the first axis
In [46]: out = strided_axis0(a, L)
In [47]: np.allclose(a[0:L], out[0]) # Verify outputs
Out[47]: True
In [48]: np.allclose(a[1:L+1], out[1])
Out[48]: True
In [49]: np.allclose(a[2:L+2], out[2])
Out[49]: True
You can slice your array with a prepared slicing array
a = np.array(list('abcdefg'))
b = np.array([
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]
])
a[b]
However, b doesn't have to generated by hand in this way. It can be more dynamic with
b = np.arange(5) + np.arange(3)[:, None]
In the general case you have to do some sort of iteration - and concatenation - either when constructing the indexes or when collecting the results. It's only when the slicing pattern is itself regular that you can use a generalized slicing via as_strided.
The accepted answer constructs an indexing array, one row per slice. So that is iterating over the slices, and arange itself is a (fast) iteration. And np.array concatenates them on a new axis (np.stack generalizes this).
In [264]: np.array([np.arange(0,5), np.arange(1,6), np.arange(2,7)])
Out[264]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
indexing_tricks convenience methods to do the same thing:
In [265]: np.r_[0:5, 1:6, 2:7]
Out[265]: array([0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6])
This takes the slicing notation, expands it with arange and concatenates. It even lets me expand and concatenate into 2d
In [269]: np.r_['0,2',0:5, 1:6, 2:7]
Out[269]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
In [270]: data=np.array(list('abcdefghijk'))
In [272]: data[np.r_['0,2',0:5, 1:6, 2:7]]
Out[272]:
array([['a', 'b', 'c', 'd', 'e'],
['b', 'c', 'd', 'e', 'f'],
['c', 'd', 'e', 'f', 'g']],
dtype='<U1')
In [273]: data[np.r_[0:5, 1:6, 2:7]]
Out[273]:
array(['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'f', 'c', 'd', 'e',
'f', 'g'],
dtype='<U1')
Concatenating results after indexing also works.
In [274]: np.stack([data[0:5],data[1:6],data[2:7]])
My memory from other SO questions is that relative timings are in the same order of magnitude. It may vary for example with the number of slices versus their length. Overall the number of values that have to be copied from source to target will be the same.
If the slices vary in length, you'd have to use the flat indexing.
No matter which approach you choose, if 2 slices contain same element, it doesn't support mathematical operations correctly unlesss you use ufunc.at which can be more inefficient than loop. For testing:
def as_strides(arr, window_size, stride, writeable=False):
'''Get a strided sub-matrices view of a 4D ndarray.
Args:
arr (ndarray): input array with shape (batch_size, m1, n1, c).
window_size (tuple): with shape (m2, n2).
stride (tuple): stride of windows in (y_stride, x_stride).
writeable (bool): it is recommended to keep it False unless needed
Returns:
subs (view): strided window view, with shape (batch_size, y_nwindows, x_nwindows, m2, n2, c)
See also numpy.lib.stride_tricks.sliding_window_view
'''
batch_size = arr.shape[0]
m1, n1, c = arr.shape[1:]
m2, n2 = window_size
y_stride, x_stride = stride
view_shape = (batch_size, 1 + (m1 - m2) // y_stride,
1 + (n1 - n2) // x_stride, m2, n2, c)
strides = (arr.strides[0], y_stride * arr.strides[1],
x_stride * arr.strides[2]) + arr.strides[1:]
subs = np.lib.stride_tricks.as_strided(arr,
view_shape,
strides=strides,
writeable=writeable)
return subs
import numpy as np
np.random.seed(1)
Xs = as_strides(np.random.randn(1, 5, 5, 2), (3, 3), (2, 2), writeable=True)[0]
print('input\n0,0\n', Xs[0, 0])
np.add.at(Xs, np.s_[:], 5)
print('unbuffered sum output\n0,0\n', Xs[0,0])
np.add.at(Xs, np.s_[:], -5)
Xs = Xs + 5
print('normal sum output\n0,0\n', Xs[0, 0])
We can use list comprehension for this
data=np.array([1,2,3,4,5,6,7,8,9,10])
data_extractions=[data[b:b+5] for b in [1,2,3,4,5]]
data_extractions
Results
[array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]), array([4, 5, 6, 7, 8]), array([5, 6, 7, 8, 9]), array([ 6, 7, 8, 9, 10])]

Decomposing pairs of run-lengths in NumPy

I am wondering whether the below-described function would be possible to implement quickly and efficiently in NumPy, or whether I would have to resort to Cython.
Let us say I have two vectors representing the runs in compressed run-length encodings. These do not have to be the same length. Furthermore, a run-length can never be zero.
r1 = np.array([1, 1, 3])
r2 = np.array([1, 3, 2])
I want a function that decomposes these and returns the lengths they share and the index in the respective array to the run that was decomposed.
Like
shared_runs, idx1, idx2 = f(r1, r2)
print(shared_runs)
# [1, 1, 2, 1, 1]
print(idx1)
# [0, 1, 2, 2, nan] # nan since the length of r1 is 5
print(idx2)
# [0, 1, 1, 2, 2]
idx2 explained:
The first shared run has length 1, same as the first run of r2. Therefore the shared_runs[0] correspond to r2[0]. shared_run[1] has the value 1. However, r2[1] has value 3. Therefore both idx2[1] and idx2[2] points to 1, since the shared_run[1] and shared_run[2] together decompose the value 3.
Further examples:
r1 = np.array([5])
r2 = np.array([1])
# [1, 4], [0, 0], [0, nan]
r1 = np.array([3, 2, 1])
r2 = np.array([1, 4])
# [1, 2, 2, 1], [0, 0, 1, 2], [0, 1, 1, nan]

Default value when indexing outside of a numpy array, even with non-trivial indexing

Is it possible to look up entries from an nd array without throwing an IndexError?
I'm hoping for something like:
>>> a = np.arange(10) * 2
>>> a[[-4, 2, 8, 12]]
IndexError
>>> wrap(a, default=-1)[[-4, 2, 8, 12]]
[-1, 4, 16, -1]
>>> wrap(a, default=-1)[200]
-1
Or possibly more like get_with_default(a, [-4, 2, 8, 12], default=-1)
Is there some builtin way to do this? Can I ask numpy not to throw the exception and return garbage, which I can then replace with my default value?
np.take with clip mode, sort of does this
In [155]: a
Out[155]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
In [156]: a.take([-4,2,8,12],mode='raise')
...
IndexError: index 12 is out of bounds for size 10
In [157]: a.take([-4,2,8,12],mode='wrap')
Out[157]: array([12, 4, 16, 4])
In [158]: a.take([-4,2,8,12],mode='clip')
Out[158]: array([ 0, 4, 16, 18])
Except you don't have much control over the return value - here indexing on 12 return 18, the last value. And treated the -4 as out of bounds in the other direction, returning 0.
One way of adding the defaults is to pad a first
In [174]: a = np.arange(10) * 2
In [175]: ind=np.array([-4,2,8,12])
In [176]: np.pad(a, [1,1], 'constant', constant_values=-1).take(ind+1, mode='clip')
Out[176]: array([-1, 4, 16, -1])
Not exactly pretty, but a start.
This is my first post on any stack exchange site so forgive me for any stylistic errors (hopefully there are only stylistic errors). I am interested in the same feature but could not find anything from numpy better than np.take mentioned by hpaulj. Still np.take doesn't do exactly what's needed. Alfe's answer works but would need some elaboration in order to handle n-dimensional inputs. The following is another workaround that generalizes to the n-dimensional case. The basic idea is similar the one used by Alfe: create a new index with the out of bounds indices masked out (in my case) or disguised (in Alfe's case) and use it to index the input array without raising an error.
def take(a,indices,default=0):
#initialize mask; will broadcast to length of indices[0] in first iteration
mask = True
for i,ind in enumerate(indices):
#each element of the mask is only True if all indices at that position are in bounds
mask = mask & (0 <= ind) & (ind < a.shape[i])
#create in_bound indices
in_bound = [ind[mask] for ind in indices]
#initialize result with default value
result = default * np.ones(len(mask),dtype=a.dtype)
#set elements indexed by in_bound to their appropriate values in a
result[mask] = a[tuple(in_bound)]
return result
And here is the output from Eric's sample problem:
>>> a = np.arange(10)*2
>>> indices = (np.array([-4,2,8,12]),)
>>> take(a,indices,default=-1)
array([-1, 4, 16, -1])
You can restrict the range of the indexes to the size of your value array you want to index in using np.maximum() and np.minimum().
Example:
I have a heatmap like
h = np.array([[ 2, 3, 1],
[ 3, -1, 5]])
and I have a palette of RGB values I want to use to color the heatmap. The palette only names colors for the values 0..4:
p = np.array([[0, 0, 0], # black
[0, 0, 1], # blue
[1, 0, 1], # purple
[1, 1, 0], # yellow
[1, 1, 1]]) # white
Now I want to color my heatmap using the palette:
p[h]
Currently this leads to an error because of the values -1 and 5 in the heatmap:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for axis 0 with size 5
But I can limit the range of the heatmap:
p[np.maximum(np.minimum(h, 4), 0)]
This works and gives me the result:
array([[[1, 0, 1],
[1, 1, 0],
[0, 0, 1]],
[[1, 1, 0],
[0, 0, 0],
[1, 1, 1]]])
If you really need to have a special value for the indexes which are out of bound, you could implement your proposed get_with_default() like this:
def get_with_default(values, indexes, default=-1):
return np.concatenate([[default], values, [default]])[
np.maximum(np.minimum(indexes, len(values)), -1) + 1]
a = np.arange(10) * 2
get_with_default(a, [-4, 2, 8, 12], default=-1)
Will return:
array([-1, 4, 16, -1])
as wanted.