How can I efficiently mask out certain pairs in (2, N) tensor? - numpy

I have a torch tensor edge_index of shape (2, N) that represents edges in a graph. For each (x, y) there is also a (y, x), where x and y are node IDs (ints). During the forward pass of my model I need to mask out certain edges. So, for example, I have:
n1 = [0, 3, 4] # list of node ids as x
n2 = [1, 2, 1] # list of node ids as y
edge_index = [[1, 2, 0, 1, 3, 4, 2, 3, 1, 4, 2, 4], # actual edges as (x, y) and (y, x)
[2, 1, 1, 0, 4, 3, 3, 2, 4, 1, 4, 2]]
# do something that efficiently removes (x, y) and (y, x) edges as formed by n1 and n2
Final edge_index should look like:
>>> edge_index
[[1, 2, 3, 4, 2, 4],
[2, 1, 4, 3, 4, 2]]
Preferably we need to efficiently make some kind of boolean mask that I can apply to edge index e.g. as edge_index[:, mask] or something like that.
Could also be done in numpy but I'd like to avoid converting back and forth.
Edit #1:
If that can't be done, then I can think of a way so that, instead of n1 and n2, I have access to the indices of the positions I need to exclude in one tensor e.g. _except=[2, 3, 6, 7, 8, 9] (by making a dict/index once in the beginning).
Is there a way to get the desired result by "telling" edge_index to drop the indices in except? edge_index[:, _except] gives me the ones I want to get rid of. I need its complement operation.
Edit #2:
I managed to do it like this:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
for i in range(len(n1)):
mask = mask & ~(torch.tensor([n1[i], n2[i]], dtype=torch.long) == edge_index.T).all(dim=1) & ~(torch.tensor([n2[i], n1[i]], dtype=torch.long) == edge_index.T).all(dim=1)
edge_index[:, mask]
but it is too slow and I can't use it. How can I speed it up?
Edit #3: I managed to solve this Edit#1 efficiently with:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
mask[_except] = False
edge_index[:, mask]
Still interested in solving the original problem if someone comes up with something...

If you're ok with the way you suggested at Edit#1,
you get the complement result by:
edge_index[:, [i for i in range(edge_index.shape[1]) if not (i in _except)]]
hope this is fast enough for your requirement.
Edit 1:
from functools import reduce
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~reduce(lambda x, y: x | (reduce(lambda p, q: p & q, y)), res, reduce(lambda p, q: p & q, res[0]))
edge_index[:, mask]
Edit 2:
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~(res.sum(1) // 2).sum(0).bool()
edge_index[:, mask]

Related

How can I speed up this function in Python?

I am trying to figure out a way to speed up this function. I am trying to do all pairwise comparisons between the rows and columns of a dataframe (pairwise_df) and store the result. The comparison requires two numpy arrays of continuous values taken from another dataframe (df).
pairwise_df = pd.DataFrame(index = ['insert1', 'insert2', 'insert3'], columns = ['insert1', 'insert2', 'insert3'])
df = pd.DataFrame(data = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
[2, 3, 4, 5, 7, 9, 10, 1, 2, 3]], index = ['insert1', 'insert2', 'insert3'], columns = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for row in list(pairwise_df.index.values):
for col in list(pairwise_df):
pairwise_df.at[row, col] = cosine_sim(np.array(df.loc[row]), np.array(df.loc[col]))
This works, but takes about 18mins to run on a 2000 x 2000 dataframe, and i'm sure there are ways to speed this up, but my programming experience is minimal.
The cosine_sim function is here, but the function used will vary so it doesn't matter too much:
def cosine_sim(x, y):
dot = np.dot(x, y)
norma = np.linalg.norm(x)
normb = np.linalg.norm(y)
cos = dot / (norma * normb)
return cos
Thanks!
You can avoid loops to compute cosine similarity by creating the array of all combinations using np.tile and np.reshape. The trick here is to use np.einsum to replace the dot product.
m = df.values
x = np.tile(m, m.shape[0]).reshape(-1, m.shape[1])
y = np.tile(m.T, m.shape[0]).T
c = np.einsum('ij,ij->i', x, y) / (np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1))
>>> c.reshape(-1, m.shape[0])
array([[1. , 0.57142857, 0.75283826],
[0.57142857, 1. , 0.74102903],
[0.75283826, 0.74102903, 1. ]])

Comparing a NumPy array to another

I have 2 NumPy arrays, such as:
correct = np.array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
predicted = np.array([1, 1, 2, 1, 1, 1, 2, 2, 1, 2])
I would like to create 2 new arrays, which contain the indices of 1's incorrectly predicted as something else and the indices of 2's incorrectly predicted as something else, respectively. Desired result:
incorrect_ones = [2]
incorrect_twos = [5, 8]
There just has to be some NumPy way to achieve this... Any ideas?
Thanks.
Calculate the boolean conditions and find the indexes of the locations of the True values:
np.where((correct == 1) & (predicted == 2))[0]
# array([2])
np.where((correct == 2) & (predicted == 1))[0]
# array([5, 8])

Find minimum absolute difference of elements in numpy array

I have an array of arrays of shape (n, m), as well as an array b of shape (m). I want to create an array c containing distances to the closest element. I can do it with this code:
a = [[11, 2, 3, 4, 5], [4, 4, 6, 1, -2]]
b = [1, 3, 12, 0, 0]
c = []
for inner in range(len(a[0])):
min_distance = float('inf')
for outer in range(len(a)):
current_distance = abs(b[inner] - a[outer][inner])
if min_distance > current_distance:
min_distance = current_distance
c.append(min_distance)
# c=[3, 1, 6, 1, 2]
Elementwise iteration is very slow. What is the numpy way to do this?
If I understand your goal correctly, I think that this would do:
>>> c = np.min(np.abs(np.array(a) - b), axis = 0)
>>> c
array([3, 1, 6, 1, 2])

Best way to get joint probability matrix from categorical data

My goal is to get joint probability (here we use count for example) matrix from data samples. Now I can get the expected result, but I'm wondering how to optimize it. Here is my implementation:
def Fill2DCountTable(arraysList):
'''
:param arraysList: List of arrays, length=2
each array is of shape (k, sampleSize),
k == 1 (or None. numpy will align it) if it's single variable
else k for a set of variables of size k
:return: xyJointCounts, xMarginalCounts, yMarginalCounts
'''
jointUniques, jointCounts = np.unique(np.vstack(arraysList), axis=1, return_counts=True)
_, xReverseIndexs = np.unique(jointUniques[[0]], axis=1, return_inverse=True) ###HIGHLIGHT###
_, yReverseIndexs = np.unique(jointUniques[[1]], axis=1, return_inverse=True)
xyJointCounts = np.zeros((xReverseIndexs.max() + 1, yReverseIndexs.max() + 1), dtype=np.int32)
xyJointCounts[tuple(np.vstack([xReverseIndexs, yReverseIndexs]))] = jointCounts
xMarginalCounts = np.sum(xyJointCounts, axis=1) ###HIGHLIGHT###
yMarginalCounts = np.sum(xyJointCounts, axis=0)
return xyJointCounts, xMarginalCounts, yMarginalCounts
def Fill3DCountTable(arraysList):
# :param arraysList: List of arrays, length=3
jointUniques, jointCounts = np.unique(np.vstack(arraysList), axis=1, return_counts=True)
_, xReverseIndexs = np.unique(jointUniques[[0]], axis=1, return_inverse=True)
_, yReverseIndexs = np.unique(jointUniques[[1]], axis=1, return_inverse=True)
_, SReverseIndexs = np.unique(jointUniques[2:], axis=1, return_inverse=True)
SxyJointCounts = np.zeros((SReverseIndexs.max() + 1, xReverseIndexs.max() + 1, yReverseIndexs.max() + 1), dtype=np.int32)
SxyJointCounts[tuple(np.vstack([SReverseIndexs, xReverseIndexs, yReverseIndexs]))] = jointCounts
SMarginalCounts = np.sum(SxyJointCounts, axis=(1, 2))
SxJointCounts = np.sum(SxyJointCounts, axis=2)
SyJointCounts = np.sum(SxyJointCounts, axis=1)
return SxyJointCounts, SMarginalCounts, SxJointCounts, SyJointCounts
My use scenario is to do conditional independence test over variables. SampleSize is usually quite big (~10k) and each variable's categorical cardinality is relatively small (~10). I still find the speed not satisfying.
How to best optimize this code, or even logic outside the code? I may have some thoughts:
The ###HIGHLIGHT### lines. On a single X I may calculate (X;Y1), (Y2;X), (X;Y3|S1)... for many times, so what if I save cache variable's (and conditional set's) {uniqueValue: reversedIndex} dictionary and its marginal count, and then directly get marginalCounts (no need to sum) and replace to get reverseIndexs (no need to unique).
How to further use matrix parallelization to do CITest in batch, i.e. calculate (X;Y|S1), (X;Y|S2), (X;Y|S3)... simultaneously?
Will torch be faster than numpy, on same CPU? Or on GPU?
It's an open question. Thank you for any possible ideas. Big thanks for your help :)
================== A test example is as follows ==================
xs = np.array( [2, 4, 2, 3, 3, 1, 3, 1, 2, 1] )
ys = np.array( [5, 5, 5, 4, 4, 4, 4, 4, 6, 5] )
Ss = np.array([ [1, 0, 0, 0, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 1, 0, 1, 0] ])
xyJointCounts, xMarginalCounts, yMarginalCounts = Fill2DCountTable([xs, ys])
SxyJointCounts, SMarginalCounts, SxJointCounts, SyJointCounts = Fill3DCountTable([xs, ys, Ss])
get 2D from (X;Y): xMarginalCounts=[3 3 3 1], yMarginalCounts=[5 4 1], and xyJointCounts (added axes name FYI):
xy| 4 5 6
--|-------
1 | 2 1 1
2 | 0 2 1
3 | 3 0 0
4 | 0 1 0
get 3D from (X;Y|{Z1,Z2}): SxyJointCounts is of shape 4x4x3, where the first 4 means the cardinality of {Z1,Z2} (00, 01, 10, 11 with respective SMarginalCounts=[3 3 1 3]). SxJointCounts is of shape 4x4 and SyJointCounts is of shape 4x3.

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

I have two multi-dimensional tensors a and b. And I want to sort them by the values of a.
I found tf.nn.top_k is able to sort a tensor and return the indices which is used to sort the input. How can I use the returned indices from tf.nn.top_k(a, k=2) to sort b?
For example,
import tensorflow as tf
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
# How to sort b into
# sorted_b[0, 0, 0, :] = b[0, 0, indices[0, 0, 0], :]
# sorted_b[0, 0, 1, :] = b[0, 0, indices[0, 0, 1], :]
# sorted_b[0, 1, 0, :] = b[0, 1, indices[0, 1, 0], :]
# ...
Update
Combining tf.gather_nd with tf.meshgrid can be one solution. For example, the following code is tested on python 3.5 with tensorflow 1.0.0-rc0:
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
shape_a = tf.shape(a)
auxiliary_indices = tf.meshgrid(*[tf.range(d) for d in (tf.unstack(shape_a[:(a.get_shape().ndims - 1)]) + [k])], indexing='ij')
sorted_b = tf.gather_nd(b, tf.stack(auxiliary_indices[:-1] + [indices], axis=-1))
However, I wonder if there is a solution which is more readable and doesn't need to create auxiliary_indices above.
Your code have a problem.
b = tf.reshape(tf.range(60), (2, 5, 3, 7))
Because TensorFlow Cannot reshape a tensor with 60 elements to shape [2,5,3,7] (210 elements).
And you can't sort a rank 4 tensor (b) using indices of rank 3 tensors.