advanced numpy boolean indexing - numpy

I have multiple copies of boolean indexing, and I'm wondering if there is any vectorized way to do the index.
The condition boolean array has shape (3, 2) and both true value and false value have the shape (2, 4)
the desired output shape is (3, 2, 4)
A concrete example:
condition = np.array([[ True, False],
[ False, True],
[ True, False]])
true_value = np.array([[-0.32313401, -1.18761309, 0.4641033 , -0.05341635],
[-0.34072785, 0.45333183, 0.06974008, -1.4338561 ]])
false_value = np.array([[-0.0962484 , 0.5257979 , 1.22036481, 1.41949077],
[ 1.11138278, 0.56253736, 1.57296682, -0.12774857]])
the expected output:
np.array([[[-0.32313401, -1.18761309, 0.4641033 , -0.05341635],
[ 1.11138278, 0.56253736, 1.57296682, -0.12774857]],
[[-0.0962484 , 0.5257979 , 1.22036481, 1.41949077],
[-0.34072785, 0.45333183, 0.06974008, -1.4338561 ]],
[[-0.32313401, -1.18761309, 0.4641033 , -0.05341635],
[ 1.11138278, 0.56253736, 1.57296682, -0.12774857]],
])
was expected to have some where where expected = np.where(condition, true_value, false_value) but it didn't work, right now only value was broadcast to conditions but not the otherway around.

You were very close. In order for np.where to return a 3×2×4 array you need to add a dimension to condition:
expected = np.where(condition[:, :, None], true_value, false_value)

I find a solution via the broadcast product:
expected = condtion[:,:,np.newaxis]*true_value \
+ (~condtion[:,:,np.newaxis])*false_value

Related

Combining two `numpy` arrays by using one as index for columns

If I have two numpy arrays
arr1
Out [7]: array([1, 0, 1, ..., 1, 0, 0])
and
arr2
Out [6]:
array([[0.10420547, 0.8957946 ],
[0.6609819 , 0.3390181 ],
[0.16680466, 0.8331954 ],
...,
[0.27138624, 0.7286138 ],
[0.6883444 , 0.31165552],
[0.70164204, 0.298358 ]], dtype=float32)
what is the quickest way to return a new array arr3 in such a way that arr1 indicates the column that I want from arr2 for each row? I would like to return something like:
arr3
array([0.8957946, 0.6609819, 0.8331954, ... ])
I would do it by filling a new empty array and iterating but I can't think of a quicker way right now.
EDIT:
Ok, a way that I found is the following, but probably not optimal (?):
arr3 = np.array([arr2[i][arr1[i]] for i in range(len(arr2))])
returns
arr3
Out [23]:
array([0.8957946 , 0.6609819 , 0.8331954 , ..., 0.7286138 , 0.6883444 ,
0.70164204], dtype=float32)
You can do it like this:
np.take_along_axis(arr2,arr1[:,None],1).squeeze()

How to construct an equivalent multivariate normal distribution in tensorflow-probability, using TransformedDistribution?

How to construct an equivalent multivariate normal distribution in tensorflow-probability, using TransformedDistribution and tfb.ScaleMatvecLinearOperator?
I'm reading about a tutorial on a bijector in tensorflow_probability: tfp.bijectors.ScaleMatvecLinearOperator.
An example was provided.
n = 10000
loc = 0
scale = 0.5
normal = tfd.Normal(loc=loc, scale=scale)
The above codes creates a univariate normal distribution.
tril = tf.random.normal((2, 4, 4))
scale_low_tri = tf.linalg.LinearOperatorLowerTriangular(tril)
scale_low_tri.to_dense()
The above codes created a tensor consisting of 2 lower triangular matrix:
<tf.Tensor: shape=(2, 4, 4), dtype=float32, numpy=
array([[[-0.56953585, 0. , 0. , 0. ],
[ 1.1368589 , 0.32028311, 0. , 0. ],
[-0.8328388 , -1.9963025 , -0.6005632 , 0. ],
[ 0.596155 , -0.214932 , 1.0988408 , -0.41731614]],
[[ 2.0778096 , 0. , 0. , 0. ],
[-1.1863967 , 2.4897904 , 0. , 0. ],
[ 0.38001925, 1.4962028 , 1.7609248 , 0. ],
[ 2.9253726 , 0.7047957 , 0.050508 , 0.58643174]]],
dtype=float32)>
Then a matrix-vector multiplication bijector is created:
scale_lin_op = tfb.ScaleMatvecLinearOperator(scale_low_tri)
After that, a TransformedDistribution is constructed as follows:
mvn = tfd.TransformedDistribution(normal, scale_lin_op, batch_shape=[2], event_shape=[4]) #
This should have worked in the old versions of tensorflow_probability. However the constructor of TransformedDistribution is changed now and does not accept the last two parameters batch_shape and event_shape. Therefore I tried to use the following way to do the same:
mvn2 = tfd.TransformedDistribution(
distribution=tfd.Sample(
normal,
sample_shape=[4] # base_dist.event_shape == [4]
),
bijector=scale_lin_op, ) # batch_shape=[2], event_shape=[4]
mvn2
And the result seems to have the correct batch_shape and event_shape
<tfp.distributions.TransformedDistribution 'scale_matvec_linear_operatorSampleNormal' batch_shape=[2] event_shape=[4] dtype=float32>
Then, another distribution for comparison is created:
mvn3 = tfd.MultivariateNormalLinearOperator(loc=loc, scale=scale_low_tri)
mvn3
According to the tutorial, the TransformedDistribution mvn2 should be equivalent to the MultivariateNormalLinearOperator mvn3.
# Check
xn = normal.sample((n, 2, 4)) # sample_shape = (n, 2, 4)
tf.norm(mvn2.log_prob(xn) - mvn3.log_prob(xn)) / tf.norm(mvn2.log_prob(xn))
<tf.Tensor: shape=(), dtype=float32, numpy=0.7498207>
But in my result they are not equivalent. (If they are, the above tensor should be 0)
What have I done wrong?

Compare numpy arrays of different shapes

I have two numpy arrays of shapes (4,4) and (9,4)
matrix1 = array([[ 72. , 72. , 72. , 72. ],
[ 72.00396729, 72.00396729, 72.00396729, 72.00396729],
[596.29998779, 596.29998779, 596.29998779, 596.29998779],
[708.83398438, 708.83398438, 708.83398438, 708.83398438]])
matrix2 = array([[ 72.02400208, 77.68997192, 115.6057663 , 105.64997101],
[120.98195648, 77.68997192, 247.19802856, 105.64997101],
[252.6330719 , 77.68997192, 337.25634766, 105.64997101],
[342.63256836, 77.68997192, 365.60125732, 105.64997101],
[ 72.02400208, 113.53997803, 189.65515137, 149.53997803],
[196.87202454, 113.53997803, 308.13119507, 149.53997803],
[315.3480835 , 113.53997803, 405.77023315, 149.53997803],
[412.86999512, 113.53997803, 482.0453186 , 149.53997803],
[ 72.02400208, 155.81002808, 108.98254395, 183.77003479]])
I need to compare all the rows of matrix2 with every row of matrix1. How can this be done without looping in the elements of matrix1?
If it is about element-wise comparison of the rows, then check this example:
# Generate sample arrays
a = np.random.randint(0, 5, size = (4, 3))
b = np.random.randint(-1, 6, size = (5, 3))
# Compare
a == b[:, None]
The last line does the comparison for you. The output array will have shape (num_of_b_rows, num_of_a_rows, common_num_of_cols): in this case, (5, 4, 3).

Udacity Deep Learning: Assignment 1, Part 5

I'm working on the Udacity Deep Learning class and I'm working on the first assignment, problem 5 where you try to count the number of duplicates in, say, your test set and training set. (Or validation and training, etc.)
I've looked at other people's answers, but I'm not satisfied with them for various reasons. For example, I tried out someone's hash based solution. But I felt the results returned was not likely to be correct.
So the main idea is that you have an array of images that are formatted as arrays. I.e. you're trying to compare two 3-dimensional arrays on index 0. One array is the training dataset, which is 200000 rows with each row containing a 2-D array that is the values for the image. The other is the test set, with is 10000 rows with each row containing a 2-D array of an image. The goal is to find all rows in the test set that match (for now, exactly match is fine) a row in the training set. Since each 'row' is itself an image (which is a 2-d array) then to make this work fast I must be able to do a comparison of both sets as an element-wise compare of each row.
I worked up my own fairly simple solution like this:
# Find duplicates
# Loop through validation/test set and find ones that are identical matrices to something in the training data
def find_duplicates(compare_set, compare_labels, training_set, training_labels):
dup_count = 0
duplicates = []
for i in range(len(compare_set)):
if i > 100: continue
if i % 100 == 0:
print("i: ", i)
for j in range(len(training_set)):
if compare_labels[i] == training_labels[j]:
if np.array_equal(compare_set[i], training_set[j]):
duplicates.append((i,j))
dup_count += 1
return dup_count, duplicates
#print(len(valid_dataset))
print(len(train_dataset))
valid_dup_count, duplicates = find_duplicates(valid_dataset, valid_labels, train_dataset, train_labels)
print(valid_dup_count)
print(duplicates)
#test_dups = find_duplicates(test_dataset, train_dataset)
#print(test_dups)
The reason it just "continues" after 100 is because that alone takes a very long time. If I were to try to compare all 10,000 rows of the validation set to the training set, it would take forever.
I like my solution in principle because it allows me to not only count the duplicates, but get a list back of which matches existed. (Something missing on every other solution I've looked at.) This allows me to manually test that I'm getting the right solution.
What I really need is a much faster (i.e. built into Numpy) solution to compare matrices of matrices like this. I've played with 'isin' and 'where' but haven't figured out how to use those to get the results I'm after. Can someone point me in the right direction for a faster solution?
You should be able to compare a single image from compare_set throughout all the images in training_set with a single line of code using np.all(). You can provide multiple axes as a tuple in the axis argument to check array equality over rows and columns, going through each of the images. Then np.where() can give you the indices you want.
For example:
n_train = 50
n_validation = 10
h, w = 28, 28
training_set = np.random.rand(n_train, h, w)
validation_set = np.random.rand(n_validation, h, w)
# create some duplicates
training_set[5] = training_set[10]
validation_set[2] = training_set[10]
validation_set[8] = training_set[10]
duplicates = []
for i, img in enumerate(validation_set):
training_dups = np.where(np.all(training_set == img, axis=(1, 2)))[0]
for j in training_dups:
duplicates.append((i, j))
print(duplicates)
[(2, 5), (2, 10), (8, 5), (8, 10)]
Many numpy functions, np.all() included, let you specify the axes to operate on. For example, let's say you had the two arrays
>>> A = np.array([[1, 2], [3, 4]])
>>> B = np.array([[1, 2], [5, 6]])
>>> A
array([[1, 2],
[3, 4]])
>>> B
array([[1, 2],
[5, 6]])
Now, A and B have the same first row, but a different second row. If we check equality for them
>>> A == B
array([[ True, True],
[False, False]], dtype=bool)
We get an array the same shape as A and B. But what if I want the indices of the rows which are equal? Well in this case what we can do is say 'only return True if all the values in the row (i.e. the value in each column) are True'. So we can use np.all() after the equality check, and provide it the axis corresponding to the columns.
>>> np.all(A == B, axis=1)
array([ True, False], dtype=bool)
So this result is letting us know that the first row is equal in both arrays, and the second row is not all equal. We can then get the row indices with np.where()
>>> np.where(np.all(A == B, axis=1))
(array([0]),)
So here we see row 0, i.e. A[0] and B[0] are equal.
Now in the solution I proposed, you have a 3D array instead of these 2D arrays. We don't care if a single row is equal, we care if all the rows and columns are equal. So breaking it down as above, let's create two random 5x5 images. I'll grab one of those images and check for equality among the array of two images:
>>> imgs = np.random.rand(2, 5, 5)
>>> img = imgs[1]
>>> imgs == img
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]],
[[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]]], dtype=bool)
So this is obvious that the second one is correct, but I want to reduce all those True values to one True value; I only want the index corresponding to images where every value is equal.
If we use axis=1
>>> np.all(imgs == img, axis=1)
array([[False, False, False, False, False],
[ True, True, True, True, True]], dtype=bool)
Then we get True for each row if all the columns in each row are equivalent. And really we want to reduce this further by checking equality along all the rows as well. So we can take this result, feed it into np.all() and check along the rows of the resulting array:
>>> np.all(np.all(imgs == img, axis=1), axis=1)
array([False, True], dtype=bool)
And this gives us a boolean of which image inside imgs is equal to img, and we can simply get the result with np.where(). But you don't actually need to call np.all() twice like this; instead you can provide it multiple axes in a tuple to just reduce along both the rows and columns in one step:
>>> np.all(imgs == img, axis=(1, 2))
array([False, True], dtype=bool)
And that's what the solution above does. Hope that clears it up!

When does advanced indexing on structured masked arrays *really* return a copy?

When I have a structured masked array with boolean indexing, under what conditions do I get a view and when do I get a copy? The documentation says that advanced indexing always returns a copy, but this is not true, since something like X[X>0]=42 is technically advanced indexing, but the assignment works. My situation is more complex:
I want to set the mask of a particular field based on a criterion from another field, so I need to get the field, apply the boolean indexing, and get the mask. There are 3! = 6 orders of doing so.
Preparation:
In [83]: M = ma.MaskedArray(random.random(400).view("f8,f8,f8,f8")).reshape(10, 10)
In [84]: crit = M[:, 4]["f2"] > 0.5
Field - index - mask (fails):
In [85]: M["f3"][crit, 3].mask = True
In [86]: print(M["f3"][crit, 3].mask)
[False False False False False]
Index - field - mask (fails):
In [87]: M[crit, 3]["f3"].mask = True
In [88]: print(M[crit, 3]["f3"].mask)
[False False False False False]
Index - mask - field (fails):
In [94]: M[crit, 3].mask["f3"] = True
In [95]: print(M[crit, 3].mask["f3"])
[False False False False False]
Mask - index - field (fails):
In [101]: M.mask[crit, 3]["f3"] = True
In [102]: print(M.mask[crit, 3]["f3"])
[False False False False False]
Field - mask - index (succeeds):
In [103]: M["f3"].mask[crit, 3] = True
In [104]: print(M["f3"].mask[crit, 3])
[ True True True True True]
# set back to False so I can try method #6
In [105]: M["f3"].mask[crit, 3] = False
In [106]: print(M["f3"].mask[crit, 3])
[False False False False False]
Mask - field - index (succeeds):
In [107]: M.mask["f3"][crit, 3] = True
In [108]: print(M.mask["f3"][crit, 3])
[ True True True True True]
So, it looks like indexing must come last.
The issue of __setitem__ v. __getitem__ is important, but with structured array and masking it's a little harder to sort out when a __getitem__ is first making a copy.
Regarding the structured arrays, it shouldn't matter whether the field index occurs first or the element. However some releases appear to have a bug in this regard. I'll try to find a recent SO question where this was a problem.
With a masked array, there's the question of how to correctly modify the mask. The .mask is a property that accesses the underlying ._mask array. But that is fetched with __getattr__. So the simple setitem v getitem distinction does not apply directly.
Lets skip the structured bit first
In [584]: M = np.ma.MaskedArray(np.arange(4))
In [585]: M
Out[585]:
masked_array(data = [0 1 2 3],
mask = False,
fill_value = 999999)
In [586]: M.mask
Out[586]: False
In [587]: M.mask[[1,2]]=True
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-587-9010ee8f165e> in <module>()
----> 1 M.mask[[1,2]]=True
TypeError: 'numpy.bool_' object does not support item assignment
Initially mask is a scalar boolean, not an array.
This works
In [588]: M.mask=np.zeros((4,),bool) # change mask to array
In [589]: M
Out[589]:
masked_array(data = [0 1 2 3],
mask = [False False False False],
fill_value = 999999)
In [590]: M.mask[[1,2]]=True
In [591]: M
Out[591]:
masked_array(data = [0 -- -- 3],
mask = [False True True False],
fill_value = 999999)
This does not
In [592]: M[[1,2]].mask=True
In [593]: M
Out[593]:
masked_array(data = [0 -- -- 3],
mask = [False True True False],
fill_value = 999999)
M[[1,2]] is evidently the copy, and the assignment is to its mask attribute, not M.mask.
....
A masked array has .__setmask__ method. You can study that in np.ma.core.py. And the mask property is defined with
mask = property(fget=_get_mask, fset=__setmask__, doc="Mask")
So M.mask=... does use this.
So it looks like the problem case is doing
M.__getitem__(index).__setmask__(values)
hence the copy. The M.mask[]=... is doing
M._mask.__setitem__(index, values)
since _getmask just does return self._mask.
M["f3"].mask[crit, 3] = True
works because M['f3'] is a view. (M[['f1','f3']] is ok for get, but doesn't work for setting).
M.mask["f3"] is also a view. I'm not entirely sure of the order the relevant get and sets. __setmask__ has code that deals specifically with compound dtype (structured).
=========================
Looking at a structured array, without the masking complication, the indexing order matters
In [607]: M1 = np.arange(16).view("i,i")
In [609]: M1[[3,4]]['f1']=[3,4] # no change
In [610]: M1[[3,4]]['f1']
Out[610]: array([7, 9], dtype=int32)
In [611]: M1['f1'][[3,4]]=[1,2] # change
In [612]: M1
Out[612]:
array([(0, 1), (2, 3), (4, 5), (6, 1), (8, 2), (10, 11), (12, 13), (14, 15)], dtype=[('f0', '<i4'), ('f1', '<i4')])
So we still have a __getitem__ followed by a __setitem__, and we have to pay attention as to whether the get returns a view or a copy.
This is because although advanced indexing returns a copy, assigning to advanced indexing still works. Only the method where advanced indexing is the last operation is assigning to advanced indexing (through __setitem__).