Comparing a NumPy array to another

Comparing a NumPy array to another - numpy

I have 2 NumPy arrays, such as:
correct = np.array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
predicted = np.array([1, 1, 2, 1, 1, 1, 2, 2, 1, 2])
I would like to create 2 new arrays, which contain the indices of 1's incorrectly predicted as something else and the indices of 2's incorrectly predicted as something else, respectively. Desired result:
incorrect_ones = [2]
incorrect_twos = [5, 8]
There just has to be some NumPy way to achieve this... Any ideas?
Thanks.

Calculate the boolean conditions and find the indexes of the locations of the True values:
np.where((correct == 1) & (predicted == 2))[0]
# array([2])
np.where((correct == 2) & (predicted == 1))[0]
# array([5, 8])

Related

How can I efficiently mask out certain pairs in (2, N) tensor?

I have a torch tensor edge_index of shape (2, N) that represents edges in a graph. For each (x, y) there is also a (y, x), where x and y are node IDs (ints). During the forward pass of my model I need to mask out certain edges. So, for example, I have:
n1 = [0, 3, 4] # list of node ids as x
n2 = [1, 2, 1] # list of node ids as y
edge_index = [[1, 2, 0, 1, 3, 4, 2, 3, 1, 4, 2, 4], # actual edges as (x, y) and (y, x)
[2, 1, 1, 0, 4, 3, 3, 2, 4, 1, 4, 2]]
# do something that efficiently removes (x, y) and (y, x) edges as formed by n1 and n2
Final edge_index should look like:
>>> edge_index
[[1, 2, 3, 4, 2, 4],
[2, 1, 4, 3, 4, 2]]
Preferably we need to efficiently make some kind of boolean mask that I can apply to edge index e.g. as edge_index[:, mask] or something like that.
Could also be done in numpy but I'd like to avoid converting back and forth.
Edit #1:
If that can't be done, then I can think of a way so that, instead of n1 and n2, I have access to the indices of the positions I need to exclude in one tensor e.g. _except=[2, 3, 6, 7, 8, 9] (by making a dict/index once in the beginning).
Is there a way to get the desired result by "telling" edge_index to drop the indices in except? edge_index[:, _except] gives me the ones I want to get rid of. I need its complement operation.
Edit #2:
I managed to do it like this:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
for i in range(len(n1)):
mask = mask & ~(torch.tensor([n1[i], n2[i]], dtype=torch.long) == edge_index.T).all(dim=1) & ~(torch.tensor([n2[i], n1[i]], dtype=torch.long) == edge_index.T).all(dim=1)
edge_index[:, mask]
but it is too slow and I can't use it. How can I speed it up?
Edit #3: I managed to solve this Edit#1 efficiently with:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
mask[_except] = False
edge_index[:, mask]
Still interested in solving the original problem if someone comes up with something...

If you're ok with the way you suggested at Edit#1,
you get the complement result by:
edge_index[:, [i for i in range(edge_index.shape[1]) if not (i in _except)]]
hope this is fast enough for your requirement.
Edit 1:
from functools import reduce
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~reduce(lambda x, y: x | (reduce(lambda p, q: p & q, y)), res, reduce(lambda p, q: p & q, res[0]))
edge_index[:, mask]
Edit 2:
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~(res.sum(1) // 2).sum(0).bool()
edge_index[:, mask]

How can I speed up this function in Python?

I am trying to figure out a way to speed up this function. I am trying to do all pairwise comparisons between the rows and columns of a dataframe (pairwise_df) and store the result. The comparison requires two numpy arrays of continuous values taken from another dataframe (df).
pairwise_df = pd.DataFrame(index = ['insert1', 'insert2', 'insert3'], columns = ['insert1', 'insert2', 'insert3'])
df = pd.DataFrame(data = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
[2, 3, 4, 5, 7, 9, 10, 1, 2, 3]], index = ['insert1', 'insert2', 'insert3'], columns = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for row in list(pairwise_df.index.values):
for col in list(pairwise_df):
pairwise_df.at[row, col] = cosine_sim(np.array(df.loc[row]), np.array(df.loc[col]))
This works, but takes about 18mins to run on a 2000 x 2000 dataframe, and i'm sure there are ways to speed this up, but my programming experience is minimal.
The cosine_sim function is here, but the function used will vary so it doesn't matter too much:
def cosine_sim(x, y):
dot = np.dot(x, y)
norma = np.linalg.norm(x)
normb = np.linalg.norm(y)
cos = dot / (norma * normb)
return cos
Thanks!

You can avoid loops to compute cosine similarity by creating the array of all combinations using np.tile and np.reshape. The trick here is to use np.einsum to replace the dot product.
m = df.values
x = np.tile(m, m.shape[0]).reshape(-1, m.shape[1])
y = np.tile(m.T, m.shape[0]).T
c = np.einsum('ij,ij->i', x, y) / (np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1))
>>> c.reshape(-1, m.shape[0])
array([[1. , 0.57142857, 0.75283826],
[0.57142857, 1. , 0.74102903],
[0.75283826, 0.74102903, 1. ]])

Find minimum absolute difference of elements in numpy array

I have an array of arrays of shape (n, m), as well as an array b of shape (m). I want to create an array c containing distances to the closest element. I can do it with this code:
a = [[11, 2, 3, 4, 5], [4, 4, 6, 1, -2]]
b = [1, 3, 12, 0, 0]
c = []
for inner in range(len(a[0])):
min_distance = float('inf')
for outer in range(len(a)):
current_distance = abs(b[inner] - a[outer][inner])
if min_distance > current_distance:
min_distance = current_distance
c.append(min_distance)
# c=[3, 1, 6, 1, 2]
Elementwise iteration is very slow. What is the numpy way to do this?

If I understand your goal correctly, I think that this would do:
>>> c = np.min(np.abs(np.array(a) - b), axis = 0)
>>> c
array([3, 1, 6, 1, 2])

Pandas - Row mask and 2d ndarray assignement

Got some problems with pandas, I think I'm not using it properly, and I would need some help to do it right.
So, I got a mask for rows of a dataframe, this mask is a simple list of Boolean values.
I would like to assign a 2D array, to a new or existing column.
mask = some_row_mask()
my2darray = some_operation(dataframe.loc[mask, column])
dataframe.loc[mask, new_or_exist_column] = my2darray
# Also tried this
dataframe.loc[mask, new_or_exist_column] = [f for f in my2darray]
Example data:
dataframe = pd.DataFrame({'Fun': ['a', 'b', 'a'], 'Data': [10, 20, 30]})
mask = dataframe['Fun']=='a'
my2darray = [[0, 1, 2, 3, 4], [4, 3, 2, 1, 0]]
column = 'Data'
new_or_exist_column = 'NewData'
Expected output
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]
dataframe[mask] and my2darray have both the exact same number of rows, but it always end with :
ValueError: Mus have equal len keys and value when setting with ndarray.
Thanks for your help!
EDIT - In context:
I just add some precisions, it was made for filling folds steps by steps: I compute and set some values from sub part of the dataframe.
Instead of this, according to Parth:
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
I changed to this:
dataframe.loc[mask, out] = pd.Series([f for f in features], index=mask[mask==True].index)
All values already set are overwrite by NaN values otherwise.
I miss to give some informations about it.
Thanks!

Try this:
dataframe[new_or_exist_column]=np.nan
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
It will give desired output:
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

I have two multi-dimensional tensors a and b. And I want to sort them by the values of a.
I found tf.nn.top_k is able to sort a tensor and return the indices which is used to sort the input. How can I use the returned indices from tf.nn.top_k(a, k=2) to sort b?
For example,
import tensorflow as tf
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
# How to sort b into
# sorted_b[0, 0, 0, :] = b[0, 0, indices[0, 0, 0], :]
# sorted_b[0, 0, 1, :] = b[0, 0, indices[0, 0, 1], :]
# sorted_b[0, 1, 0, :] = b[0, 1, indices[0, 1, 0], :]
# ...
Update
Combining tf.gather_nd with tf.meshgrid can be one solution. For example, the following code is tested on python 3.5 with tensorflow 1.0.0-rc0:
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
shape_a = tf.shape(a)
auxiliary_indices = tf.meshgrid(*[tf.range(d) for d in (tf.unstack(shape_a[:(a.get_shape().ndims - 1)]) + [k])], indexing='ij')
sorted_b = tf.gather_nd(b, tf.stack(auxiliary_indices[:-1] + [indices], axis=-1))
However, I wonder if there is a solution which is more readable and doesn't need to create auxiliary_indices above.

Your code have a problem.
b = tf.reshape(tf.range(60), (2, 5, 3, 7))
Because TensorFlow Cannot reshape a tensor with 60 elements to shape [2,5,3,7] (210 elements).
And you can't sort a rank 4 tensor (b) using indices of rank 3 tensors.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Comparing a NumPy array to another - numpy

Calculate the boolean conditions and find the indexes of the locations of the True values: np.where((correct == 1) & (predicted == 2))[0] # array([2]) np.where((correct == 2) & (predicted == 1))[0] # array([5, 8])

Related

How can I efficiently mask out certain pairs in (2, N) tensor?

How can I speed up this function in Python?

Find minimum absolute difference of elements in numpy array

Pandas - Row mask and 2d ndarray assignement

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

Categories

Resources