I have a 2D array and a boolean mask of the same size. I want to use the mask to coalesce consecutive rows in the 2D array: By coalesce I mean to reduce the rows by taking the first occurrence. An example:
rows = np.r_['1,2,0', :6, :6]
mask = np.tile([1, 1, 0, 0, 1, 1], (2,1)).T.astype(bool)
Expected output:
array([[0, 0],
[2, 2],
[3, 3],
[4, 4])
And to illustrate how the output might be obtained:
array([[0, 0], array([[0, 0], array([[0, 0],
[1, 1], [0, 0], [2, 2],
[2, 2], -> select -> [2, 2], -> reduce -> [3, 3],
[3, 3], [3, 3], [4, 4]])
[4, 4], [4, 4],
[5, 5]]) [4, 4]])
What I have tried:
rows[~mask].reshape(-1,2)
But this will only select the rows which should not be reduced.
Upgraded answer
I realized that my initial submission did a lot of unnecessary operations, I realized that given mask
mask = [1,1,0,0,1,1,0,0,1,1,1,0]
You simply want to negate the leading ones:
#negate:v v v
mask = [0,1,0,0,0,1,0,0,0,1,1,0]
then negate the mask to get your wanted rows. This way is MUCH more efficient than doing a forward fill on indices and removing repeated indices (see old answer). Revised solution:
import numpy as np
rows = np.r_['1,2,0', :6, :6]
mask = np.tile([1, 1, 0, 0, 1, 1], (2,1)).T.astype(bool)
def maskforwardfill(a: np.ndarray, mask: np.ndarray):
mask = mask.copy()
mask[1:] = mask[1:] & mask[:-1] # Negate leading True values
mask[0] = False # First element should always be False, either it is False anyways, or it is a leading True value (which should be set to False)
return a[~mask] # index out wanted rows
# Reduce mask's dimension since I assume that you only do complete rows
print(maskforwardfill(rows, mask.any(1)))
#[[0 0]
# [2 2]
# [3 3]
# [4 4]]
Old answer
Here I assume that you only need complete rows (like in #Arne's answer). My idea is that given the mask and the corresponding array indices
mask = [1,1,0,0,1,1]
indices = [0,1,2,3,4,5]
you can use np.diff to first obtain
indices = [0,-1,2,3,4,-1]
Then a forward fill (where -1 acts as nan) on the indices such that you get
[0,0,2,3,4,4]
of which can use np.unique to remove repeated indices:
[0,2,3,4] # The rows indices you want
Code:
import numpy as np
rows = np.r_['1,2,0', :6, :6]
mask = np.tile([1, 1, 0, 0, 1, 1], (2,1)).T.astype(bool)
def maskforwardfill(a: np.ndarray, mask: np.ndarray):
mask = mask.copy()
indices = np.arange(len(a))
mask[np.diff(mask,prepend=[0]) == 1] = False # set leading True to False
indices[mask] = -1
indices = np.maximum.accumulate(indices) # forward fill indices
indices = np.unique(indices) # remove repeats
return a[indices] # index out wanted rows
# Reduce mask's dimension since I assume that you only do complete rows
print(maskforwardfill(rows, mask.any(1)))
#[[0 0]
# [2 2]
# [3 3]
# [4 4]]
Assuming it's always about complete rows, you can reduce the mask to one dimension. Then a straightforward approach is to iterate over the rows:
# reduce mask to one dimension for row selection
mask_1d = mask.any(axis=1)
# replace rows with previous ones based on mask
for i in range(1, len(rows)):
if mask_1d[i-1] and mask_1d[i]:
rows[i] = rows[i-1]
# leave out repeated rows
reduced = [rows[0]]
for i in range(1, len(rows)):
if not (rows[i] == rows[i-1]).all():
reduced.append(rows[i])
reduced = np.array(reduced)
reduced
array([[0, 0],
[2, 2],
[3, 3],
[4, 4]])
Related
I have 2 2d numpy arrays A and B
I want to remove all the rows in A which appear in B.
I tried something like this:
A[~np.isin(A, B)]
but isin keeps the dimensions of A, I need one boolean value per row to filter it.
EDIT: something like this
A = np.array([[3, 0, 4],
[3, 1, 1],
[0, 5, 9]])
B = np.array([[1, 1, 1],
[3, 1, 1]])
.....
A = np.array([[3, 0, 4],
[0, 5, 9]])
Probably not the most performant solution, but does exactly what you want. You can change the dtype of A and B to be a unit consisting of one row. You need to ensure that the arrays are contiguous first, e.g. with ascontiguousarray:
Av = np.ascontiguousarray(A).view(np.dtype([('', A.dtype, A.shape[1])])).ravel()
Bv = np.ascontiguousarray(B).view(Av.dtype).ravel()
Now you can apply np.isin directly:
>>> np.isin(Av, Bv)
array([False, True, False])
According to the docs, invert=True is faster than negating the output of isin, so you can do
A[np.isin(Av, Bv, invert=True)]
Try the following - it uses matrix multiplication for dimensionality reduction:
import numpy as np
A = np.array([[3, 0, 4],
[3, 1, 1],
[0, 5, 9]])
B = np.array([[1, 1, 1],
[3, 1, 1]])
arr_max = np.maximum(A.max(0) + 1, B.max(0) + 1)
print (A[~np.isin(A.dot(arr_max), B.dot(arr_max))])
Output:
[[3 0 4]
[0 5 9]]
This is certainly not the most performant solution but it is relatively easy to read:
A = np.array([row for row in A if row not in B])
Edit:
I found that the code does not correctly work, but this does:
A = [row for row in A if not any(np.equal(B, row).all(1))]
Could someone explain me why the second assertion below fails? I do not understand why using a slice or a range for indexing would make a difference in this case.
import numpy as np
d = np.zeros(shape = (1,2,3))
assert d[:, 0, slice(0,2)].shape == d[:, 0, range(0,2)].shape #This doesn't trigger an exception as both operands return (1,2)
assert d[0, :, slice(0,2)].shape == d[0, :, range(0,2)].shape #This does because (1,2) != (2,1)...
Make the array more diagnostic:
In [66]: d = np.arange(6).reshape(1,2,3)
In [67]: d
Out[67]:
array([[[0, 1, 2],
[3, 4, 5]]])
scalar index in the middle:
In [68]: d[:,0,:2]
Out[68]: array([[0, 1]])
In [69]: d[:,0,range(2)]
Out[69]: array([[0, 1]])
Shape is (1,2) for both, though the 2nd is a copy because of the advanced indexing of the last dimension.
Shape is the same in the 2nd set, but the order actually differs:
In [70]: d[0,:,:2]
Out[70]:
array([[0, 1],
[3, 4]])
In [71]: d[0,:,range(2)]
Out[71]:
array([[0, 3],
[1, 4]])
[71] is a case of mixed basic and advanced indexing, which is documented as doing the unexpected. The middle sliced dimension is put last.
https://numpy.org/doc/stable/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
I have the following segment of for loop in my code. The nested loop is slowing down my complete execution.
for q in range(batchSize):
temp=torch.where((composition_matrix == pred[q]).all(dim=1))[0]
if len(temp)==0:
output[q]=0
else:
output[q]=int(temp[0])
Here, composition_matrix is [14000,2] dimensional pytorch tensor with only positive integers as cell values. pred and output both are a [batchSize,2] dimensional torch tensor.
As this for loop is slowing my code a lot and I am unable to get the equivalent broadcasting solution to this code segment.
Does a broadcasting solution exists to eleminate this for loop?
I shall be grateful for any help.
A minimum reproducible example is
import torch
composition_matrix=torch.randint(3, 10, (14000,2))
batchSize=64
pred=torch.randint(3, 10, (batchSize,2))
output=torch.zeros([batchSize])
for q in range(batchSize):
temp=torch.where((composition_matrix == pred[q]).all(dim=1))[0]
if len(temp)==0:
output[q]=0
else:
output[q]=int(temp[0])
To make it simple, you first need to understand what the operation is essentially doing. You've got two tensors. Tensor A is of shape (14000, 2) and tensor B is of shape (64, 2). The operation you want to do is:
For each row B[i] in B, compare that B[i] (of shape (2,) with A (of
shape (14000, 2)). If B[i] occurs within A, set output[i] = index of
first occurrence.
This can actually be done in two lines of code (maybe even one line):
comp = (composition_matrix[:, None, :] == pred).all(dim=-1)
output = torch.argmax(comp.float(), axis=0)
The first line creates comp, the broadcasted comparison of composition_matrix and pred, a boolean tensor of shape (14000, 64).
The second line needs to find the "index of the first match". This can be done quite simply with argmax: it will return the index of the first "1" (or if all the values are "0", will return the first index, ie, 0).
(Note that torch does not support argmax for "bool" tensors, and so comp needed to be cast to another data type.)
Sorry for the short and probably over-simplified example. I fear a bigger one would be much more difficult to visualize. But I hope this suits your purpose.
My solution may seem a little complicated but it's fully vectorized and includes no explicit loops.
Here's what I would do:
import torch
torch.manual_seed(0)
batchSize = 8
pred = torch.randint(0, 10, (batchSize, 2))
output = torch.zeros((batchSize, 2))
composition_matrix = torch.randint(0, 10, (14, 2))
# compair all vectors in composition_matrix to all vectors in pred
comparisons = (composition_matrix.unsqueeze(0) == pred.unsqueeze(1))
comparisons = comparisons.all(2)
# form an index array the shape of the comparisons array
comparison_idxs = torch.arange(comparisons.shape[1])
comparison_idxs = comparison_idxs.repeat(batchSize).reshape(*comparisons.shape)
# multipy the comparisons array by the index array
where_result = (comparison_idxs*comparisons)
# replace invalind zeros with the maximal value in each sample
batch_idxs = torch.arange(comparisons.shape[0])
batch_idxs = batch_idxs.repeat(comparisons.shape[1])
batch_idxs = batch_idxs.reshape(comparisons.shape[1], comparisons.shape[0]).T
maxima = where_result.max(1).values[batch_idxs]
maxima_vecor = maxima[(1-comparisons.int()).bool()]
where_result[(1-comparisons.int()).bool()] = maxima_vecor
vectorized_output = where_result.min(1)[0]
output = torch.zeros([batchSize])
for q in range(batchSize):
temp=torch.where((composition_matrix == pred[q]).all(dim=1))[0]
if len(temp)==0:
output[q]=0
else:
output[q]=int(temp[0])
output:
composition_matrix =
tensor([[6, 8],
[4, 3],
[6, 9],
[1, 4],
[4, 1],
[9, 9],
[9, 0],
[1, 2],
[3, 0],
[5, 5],
[2, 9],
[1, 8],
[8, 3],
[6, 9]])
pred =
tensor([[4, 9],
[3, 0],
[3, 9],
[7, 3],
[7, 3],
[1, 6],
[6, 9],
[8, 6]])
output =
tensor([0., 8., 0., 0., 0., 0., 2., 0.])
vectorized_output =
tensor([0, 8, 0, 0, 0, 0, 2, 0])
Some timing results:
torch.manual_seed(0)
batchSize = 8
pred = torch.randint(0, 10, (batchSize, 2))
composition_matrix = torch.randint(0, 10, (14000, 2))
print('timing the vectorized_solution:')
%timeit -n 1000 vectorized_solution(composition_matrix, pred,)
print('timing the loop_solution:')
%timeit -n 1000 loop_solution(composition_matrix, pred,)
output:
timing the vectorized_solution:
1000 loops, best of 5: 137 µs per loop
timing the loop_solution:
1000 loops, best of 5: 1.89 ms per loop
val = np.array([[1, 3], [2, 5], [0, 6], [1, 2] ])
print(np.max(val))
6
I also want to print the row [0,6]. with axis it returns all the value from other rows as well. argmax doesnt return the row index.
One way is to use np.where which return indexes where true:
r,_ = np.where(val == np.max(val))
val[r]
Output:
array([[0, 6]])
I have a numpy array A of size ((s1,...sm)) with integer entries and a dictionary D with integers as keys and numpy arrays of size ((t)) as values. I would like to evaluate the dictionary on every entry of the array A to get a new array B of size ((s1,...sm,t)).
For example
D={1:[0,1],2:[1,0]}
A=np.array([1,2,1])
The output shout be
array([[0,1],[1,0],[0,1]])
Motivation: I have an array with indexes of unit vectors as entries and I need to transform it into an array with the vectors as entries.
If you can rename your keys to be 0-indexed, you might use direct array querying on your unit vectors:
>>> units = np.array([D[1], D[2]])
>>> B = units[A - 1] # -1 because 0 indexed: 1 -> 0, 2 -> 1
>>> B
array([[0, 1],
[1, 0],
[0, 1]])
And similarly for any shape:
>>> A = np.random.random_integers(0, 1, (10, 11, 12))
>>> A.shape
(10, 11, 12)
>>> B = units[A]
>>> B.shape
(10, 11, 12, 2)
You can learn more about advanced indexing on the numpy doc
>>> np.asarray([D[key] for key in A])
array([[0, 1],
[1, 0],
[0, 1]])
Here's an approach using np.searchsorted to locate those row indices to index into the values of the dictionary and then simply indexing it to get the desired output, like so -
idx = np.searchsorted(D.keys(),A)
out = np.asarray(D.values())[idx]
Sample run -
In [45]: A
Out[45]: array([1, 2, 1])
In [46]: D
Out[46]: {1: [0, 1], 2: [1, 0]}
In [47]: idx = np.searchsorted(D.keys(),A)
...: out = np.asarray(D.values())[idx]
...:
In [48]: out
Out[48]:
array([[0, 1],
[1, 0],
[0, 1]])