Numpy: boolean comparison on two vectors - numpy

I have two vectors (or two one dimensional numpy arrays with the same number of elements) a and b where I want to find the number of cases I have that:
a < 0 and b >0
But when I type the above (or something similar) into IPython I get:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
How am I supposed to do the above operation?
Thank you

I'm not certain that I understand what you're trying to do, but you might want ((a < 0) & (b > 0)).sum()
>>> a
array([-1, 0, 2, 0])
>>> b
array([4, 0, 5, 3])
>>> a < 0
array([ True, False, False, False], dtype=bool)
>>> b > 0
array([ True, False, True, True], dtype=bool)
>>> ((a < 0) & (b > 0)).sum()
1

Related

numpy array of array with custom filtering

I am trying to filter a numpy array of array with given conditions, for example
input = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
output where the [0] >= 4, [1] >= 5, [2] >= 6
expected result = np.array([[4,5,6],[4,5,6]])
what would be the best way to achieve this with performance concern?
extended question: and how to retrieve the correspondance index of the each output elements in the input array?
You can do:
a = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
a[(a[:,0] >=4) & (a[:,1] >= 5) & (a[:,2] >=6)]
Here you create binary masks for the conditions on each elements in each row of the data, use the logical and to combine them, and finally use the resulting mask to get the matching data rows.
To find the index of the data rows matching the conditions, you can use numpys where() function:
idx = np.where((a[:,0] >=4) & (a[:,1] >= 1) & (a[:,2] >=6))[0]
As per your request, a numba version
import numpy as np
import numba as nb
import sys
import timeit
target = np.random.randint(low=-100000, high=100000, size=(int(sys.argv[2]), 3), dtype=np.int)
comp = np.array([4, 5, 6])
#nb.njit((nb.int64[:, :], nb.int64[::3]), parallel=True)
def cmp(a, b):
c = np.empty((a.shape[0],), dtype=a.dtype)
for i in nb.prange(a.shape[0]):
c[i] = a[i][0] > b[0] and a[i][1] > b[1] and a[i][2] > b[2]
return c
def cmp_normal(a, b):
# return np.all(a > b, axis=1)
return (a[:,0] >=b[0]) & (a[:,1] >= b[1]) & (a[:,2] >=b[2])
print(timeit.timeit(lambda: eval(sys.argv[1])(target, comp), number=10))
First output time is for sequential numba, second one is for parallel numba.
Parallel numba gives 5 times speed up compared to sequential
(base) xxx#xxx:~$ python test.py cmp 1000000
6.40756068899982
(base) xxx#xxx:~$ python test.py cmp 1000000
1.3425709140001345
Now vanilla numpy
(base) xxx#xxx:~$ python test.py cmp_normal 1000000
4.04174472700015
Numba parallel is fastest. But if you try to return a[c] instead, numba will slow down. So it depends on what you write
In [223]: arr =np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
In [224]: arr
Out[224]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 4, 5, 6],
[ 0, 9, 19]])
Since you are testing values, one for each column, you can do a simple numpy == test (the (3,) test broadcasts with the (4,3) arr)
In [225]: arr==[4,5,6]
Out[225]:
array([[False, False, False],
[ True, True, True],
[ True, True, True],
[False, False, False]])
and where a whole row is true:
In [226]: (arr==[4,5,6]).all(axis=1)
Out[226]: array([False, True, True, False])
This can be applied as a boolean mask to select those rows from arr:
In [227]: arr[_]
Out[227]:
array([[4, 5, 6],
[4, 5, 6]])
and the numeric indices:
In [228]: np.nonzero(__)
Out[228]: (array([1, 2]),)

Udacity Deep Learning: Assignment 1, Part 5

I'm working on the Udacity Deep Learning class and I'm working on the first assignment, problem 5 where you try to count the number of duplicates in, say, your test set and training set. (Or validation and training, etc.)
I've looked at other people's answers, but I'm not satisfied with them for various reasons. For example, I tried out someone's hash based solution. But I felt the results returned was not likely to be correct.
So the main idea is that you have an array of images that are formatted as arrays. I.e. you're trying to compare two 3-dimensional arrays on index 0. One array is the training dataset, which is 200000 rows with each row containing a 2-D array that is the values for the image. The other is the test set, with is 10000 rows with each row containing a 2-D array of an image. The goal is to find all rows in the test set that match (for now, exactly match is fine) a row in the training set. Since each 'row' is itself an image (which is a 2-d array) then to make this work fast I must be able to do a comparison of both sets as an element-wise compare of each row.
I worked up my own fairly simple solution like this:
# Find duplicates
# Loop through validation/test set and find ones that are identical matrices to something in the training data
def find_duplicates(compare_set, compare_labels, training_set, training_labels):
dup_count = 0
duplicates = []
for i in range(len(compare_set)):
if i > 100: continue
if i % 100 == 0:
print("i: ", i)
for j in range(len(training_set)):
if compare_labels[i] == training_labels[j]:
if np.array_equal(compare_set[i], training_set[j]):
duplicates.append((i,j))
dup_count += 1
return dup_count, duplicates
#print(len(valid_dataset))
print(len(train_dataset))
valid_dup_count, duplicates = find_duplicates(valid_dataset, valid_labels, train_dataset, train_labels)
print(valid_dup_count)
print(duplicates)
#test_dups = find_duplicates(test_dataset, train_dataset)
#print(test_dups)
The reason it just "continues" after 100 is because that alone takes a very long time. If I were to try to compare all 10,000 rows of the validation set to the training set, it would take forever.
I like my solution in principle because it allows me to not only count the duplicates, but get a list back of which matches existed. (Something missing on every other solution I've looked at.) This allows me to manually test that I'm getting the right solution.
What I really need is a much faster (i.e. built into Numpy) solution to compare matrices of matrices like this. I've played with 'isin' and 'where' but haven't figured out how to use those to get the results I'm after. Can someone point me in the right direction for a faster solution?
You should be able to compare a single image from compare_set throughout all the images in training_set with a single line of code using np.all(). You can provide multiple axes as a tuple in the axis argument to check array equality over rows and columns, going through each of the images. Then np.where() can give you the indices you want.
For example:
n_train = 50
n_validation = 10
h, w = 28, 28
training_set = np.random.rand(n_train, h, w)
validation_set = np.random.rand(n_validation, h, w)
# create some duplicates
training_set[5] = training_set[10]
validation_set[2] = training_set[10]
validation_set[8] = training_set[10]
duplicates = []
for i, img in enumerate(validation_set):
training_dups = np.where(np.all(training_set == img, axis=(1, 2)))[0]
for j in training_dups:
duplicates.append((i, j))
print(duplicates)
[(2, 5), (2, 10), (8, 5), (8, 10)]
Many numpy functions, np.all() included, let you specify the axes to operate on. For example, let's say you had the two arrays
>>> A = np.array([[1, 2], [3, 4]])
>>> B = np.array([[1, 2], [5, 6]])
>>> A
array([[1, 2],
[3, 4]])
>>> B
array([[1, 2],
[5, 6]])
Now, A and B have the same first row, but a different second row. If we check equality for them
>>> A == B
array([[ True, True],
[False, False]], dtype=bool)
We get an array the same shape as A and B. But what if I want the indices of the rows which are equal? Well in this case what we can do is say 'only return True if all the values in the row (i.e. the value in each column) are True'. So we can use np.all() after the equality check, and provide it the axis corresponding to the columns.
>>> np.all(A == B, axis=1)
array([ True, False], dtype=bool)
So this result is letting us know that the first row is equal in both arrays, and the second row is not all equal. We can then get the row indices with np.where()
>>> np.where(np.all(A == B, axis=1))
(array([0]),)
So here we see row 0, i.e. A[0] and B[0] are equal.
Now in the solution I proposed, you have a 3D array instead of these 2D arrays. We don't care if a single row is equal, we care if all the rows and columns are equal. So breaking it down as above, let's create two random 5x5 images. I'll grab one of those images and check for equality among the array of two images:
>>> imgs = np.random.rand(2, 5, 5)
>>> img = imgs[1]
>>> imgs == img
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]],
[[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]]], dtype=bool)
So this is obvious that the second one is correct, but I want to reduce all those True values to one True value; I only want the index corresponding to images where every value is equal.
If we use axis=1
>>> np.all(imgs == img, axis=1)
array([[False, False, False, False, False],
[ True, True, True, True, True]], dtype=bool)
Then we get True for each row if all the columns in each row are equivalent. And really we want to reduce this further by checking equality along all the rows as well. So we can take this result, feed it into np.all() and check along the rows of the resulting array:
>>> np.all(np.all(imgs == img, axis=1), axis=1)
array([False, True], dtype=bool)
And this gives us a boolean of which image inside imgs is equal to img, and we can simply get the result with np.where(). But you don't actually need to call np.all() twice like this; instead you can provide it multiple axes in a tuple to just reduce along both the rows and columns in one step:
>>> np.all(imgs == img, axis=(1, 2))
array([False, True], dtype=bool)
And that's what the solution above does. Hope that clears it up!

Numpy compare values inside to return greater index

I have a numpy array and another array:
[array([-1.67397643, -2.77258872]), array([-1.67397643, -2.77258872]), array([-2.77258872, -1.67397643]), array([-2.77258872, -1.67397643])]
Which index position inside the numpy arrays wins - i.e. -1.67397643 > -2.77258872 - so the first value would be 0.
Final output of the numpy array would be [0, 0, 1, 1] (a list is fine too)
How can I do that ?
It seems you have a list of arrays, so I would start by making them a proper numpy array:
a = [array([-1.67397643, -2.77258872]), array([-1.67397643, -2.77258872]), array([-2.77258872, -1.67397643]), array([-2.77258872, -1.67397643])]
b = np.array(a).T # .T transposes it.
c = b[0] < b[1]
c is now an array([False, False, True, True], dtype=bool), and probably serves your purpose. If you must have [0,0,1,1] instead, then:
d = np.zeros(len(c))
d[c] = 1
d is now an array([ 0., 0., 1., 1.])

TensorFlow Selecting entries (from one of two tensors) based on a boolean mask

I have three tensors, a, b, and mask, all of the same shape. I'd like to produce a new tensor c, such that each entry of c is taken from the corresponding entry of a iff the corresponding entry of mask is True; else, it is taken from the corresponding entry of b.
Example:
a = [0, 1, 2]
b = [10, 20, 30]
mask = [True, False, True]
c = [0, 20, 2]
How can I do this?
Why not use tf.select(condition, t, e, name=None)
for your example:
c = tf.select(mask, a, b)
for more details about tf.select, visit Tensorflow Control Flow Documentation
You can do it like this:
1) convert mask to ints (0 for false, 1 for true)
2) do element wise multiplication of int_mask with tensor 'a'
(elements that should not be included are going to be 0)
3) do logical_not on mask
4) convert logical_not_int_mask to ints
(again 0 for false, 1 for true values)
5) now just do element wise multiplication of logical_not_int_mask with tensor 'b'
(elements that should not be included are going to be 0)
6) Add tensors 'a' and 'b' together and there you have it.
In code it should look something like this:
# tensor 'a' is [0, 1, 2]
# tensor 'b' is [10, 20, 30]
# tensor 'mask' is [True, False, True]
int_mask = tf.cast(mask, tf.int32)
# Leave only important elements in 'a'
a = tf.mul(a, int_mask)
mask = tf.logical_not(mask)
int_mask = tf.cast(mask, tf.int32)
b = tf.mul(b, int_mask)
result = tf.add(a, b)
Or simply use tf.select() function just like someone already mentioned.

Recode a pandas.Series containing 0, 1, and NaN to False, True, and NaN

Suppose I have a Series with NaNs:
pd.Series([0, 1, None, 1])
I want to transform this to be equal to:
pd.Series([False, True, None, True])
You'd think x == 1 would suffice, but instead, this returns:
pd.Series([False, True, False, True])
where the null value has become False. This is because np.nan == 1 returns False, rather than None or np.nan as in R.
Is there a nice, vectorized way to get what I want?
Maybe map can do it:
import pandas as pd
x = pd.Series([0, 1, None, 1])
print x.map({1: True, 0: False})
0 False
1 True
2 NaN
3 True
dtype: object
You can use where:
In [11]: (s == 1).where(s.notnull(), np.nan)
Out[11]:
0 0
1 1
2 NaN
3 1
dtype: float64
Note: the True and False have been cast to float as 0 and 1.