How can I speed up this function in Python? - pandas

I am trying to figure out a way to speed up this function. I am trying to do all pairwise comparisons between the rows and columns of a dataframe (pairwise_df) and store the result. The comparison requires two numpy arrays of continuous values taken from another dataframe (df).
pairwise_df = pd.DataFrame(index = ['insert1', 'insert2', 'insert3'], columns = ['insert1', 'insert2', 'insert3'])
df = pd.DataFrame(data = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
[2, 3, 4, 5, 7, 9, 10, 1, 2, 3]], index = ['insert1', 'insert2', 'insert3'], columns = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for row in list(pairwise_df.index.values):
for col in list(pairwise_df):
pairwise_df.at[row, col] = cosine_sim(np.array(df.loc[row]), np.array(df.loc[col]))
This works, but takes about 18mins to run on a 2000 x 2000 dataframe, and i'm sure there are ways to speed this up, but my programming experience is minimal.
The cosine_sim function is here, but the function used will vary so it doesn't matter too much:
def cosine_sim(x, y):
dot = np.dot(x, y)
norma = np.linalg.norm(x)
normb = np.linalg.norm(y)
cos = dot / (norma * normb)
return cos
Thanks!

You can avoid loops to compute cosine similarity by creating the array of all combinations using np.tile and np.reshape. The trick here is to use np.einsum to replace the dot product.
m = df.values
x = np.tile(m, m.shape[0]).reshape(-1, m.shape[1])
y = np.tile(m.T, m.shape[0]).T
c = np.einsum('ij,ij->i', x, y) / (np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1))
>>> c.reshape(-1, m.shape[0])
array([[1. , 0.57142857, 0.75283826],
[0.57142857, 1. , 0.74102903],
[0.75283826, 0.74102903, 1. ]])

Related

How can I efficiently mask out certain pairs in (2, N) tensor?

I have a torch tensor edge_index of shape (2, N) that represents edges in a graph. For each (x, y) there is also a (y, x), where x and y are node IDs (ints). During the forward pass of my model I need to mask out certain edges. So, for example, I have:
n1 = [0, 3, 4] # list of node ids as x
n2 = [1, 2, 1] # list of node ids as y
edge_index = [[1, 2, 0, 1, 3, 4, 2, 3, 1, 4, 2, 4], # actual edges as (x, y) and (y, x)
[2, 1, 1, 0, 4, 3, 3, 2, 4, 1, 4, 2]]
# do something that efficiently removes (x, y) and (y, x) edges as formed by n1 and n2
Final edge_index should look like:
>>> edge_index
[[1, 2, 3, 4, 2, 4],
[2, 1, 4, 3, 4, 2]]
Preferably we need to efficiently make some kind of boolean mask that I can apply to edge index e.g. as edge_index[:, mask] or something like that.
Could also be done in numpy but I'd like to avoid converting back and forth.
Edit #1:
If that can't be done, then I can think of a way so that, instead of n1 and n2, I have access to the indices of the positions I need to exclude in one tensor e.g. _except=[2, 3, 6, 7, 8, 9] (by making a dict/index once in the beginning).
Is there a way to get the desired result by "telling" edge_index to drop the indices in except? edge_index[:, _except] gives me the ones I want to get rid of. I need its complement operation.
Edit #2:
I managed to do it like this:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
for i in range(len(n1)):
mask = mask & ~(torch.tensor([n1[i], n2[i]], dtype=torch.long) == edge_index.T).all(dim=1) & ~(torch.tensor([n2[i], n1[i]], dtype=torch.long) == edge_index.T).all(dim=1)
edge_index[:, mask]
but it is too slow and I can't use it. How can I speed it up?
Edit #3: I managed to solve this Edit#1 efficiently with:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
mask[_except] = False
edge_index[:, mask]
Still interested in solving the original problem if someone comes up with something...
If you're ok with the way you suggested at Edit#1,
you get the complement result by:
edge_index[:, [i for i in range(edge_index.shape[1]) if not (i in _except)]]
hope this is fast enough for your requirement.
Edit 1:
from functools import reduce
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~reduce(lambda x, y: x | (reduce(lambda p, q: p & q, y)), res, reduce(lambda p, q: p & q, res[0]))
edge_index[:, mask]
Edit 2:
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~(res.sum(1) // 2).sum(0).bool()
edge_index[:, mask]

numpy append in a for loop with different sizes

I have a for loop but where i has changes by 2 and i want to save a value in a numpy array in each iteration that that changes by 1.
n = 8 #steps
# random sequence
rand_seq = np.zeros(n-1)
for i in range(0, (n-1)*2, 2):
curr_state= i+3
I want to get curr_state outside the loop in the rand_seq array (seven values).
can you help me with that?
thanks a lot
A much simpler version (if I understand the question correctly) would be:
np.arange(3, 15+1, 2)
where 3 = start, 15 = stop, 2 = step size.
In general, when using numpy try to avoid adding elements in a for loop as this is inefficient. I would suggest checking out the documentation of np.arange(), np.array() and np.zeros() as in my experience, these will solve 90% of array - creation issues.
A straight forward list iteration:
In [313]: alist = []
...: for i in range(0,(8-1)*2,2):
...: alist.append(i+3)
...:
In [314]: alist
Out[314]: [3, 5, 7, 9, 11, 13, 15]
or cast as a list comprehension:
In [315]: [i+3 for i in range(0,(8-1)*2,2)]
Out[315]: [3, 5, 7, 9, 11, 13, 15]
Or if you make an array with the same range parameters:
In [316]: arr = np.arange(0,(8-1)*2,2)
In [317]: arr
Out[317]: array([ 0, 2, 4, 6, 8, 10, 12])
you can add the 3 with one simple expression:
In [318]: arr + 3
Out[318]: array([ 3, 5, 7, 9, 11, 13, 15])
With lists, iteration and comprehensions are great. With numpy you should try to make an array, such as with arange, and modify that with whole-array methods (not with iterations).

Comparing a NumPy array to another

I have 2 NumPy arrays, such as:
correct = np.array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
predicted = np.array([1, 1, 2, 1, 1, 1, 2, 2, 1, 2])
I would like to create 2 new arrays, which contain the indices of 1's incorrectly predicted as something else and the indices of 2's incorrectly predicted as something else, respectively. Desired result:
incorrect_ones = [2]
incorrect_twos = [5, 8]
There just has to be some NumPy way to achieve this... Any ideas?
Thanks.
Calculate the boolean conditions and find the indexes of the locations of the True values:
np.where((correct == 1) & (predicted == 2))[0]
# array([2])
np.where((correct == 2) & (predicted == 1))[0]
# array([5, 8])

Find minimum absolute difference of elements in numpy array

I have an array of arrays of shape (n, m), as well as an array b of shape (m). I want to create an array c containing distances to the closest element. I can do it with this code:
a = [[11, 2, 3, 4, 5], [4, 4, 6, 1, -2]]
b = [1, 3, 12, 0, 0]
c = []
for inner in range(len(a[0])):
min_distance = float('inf')
for outer in range(len(a)):
current_distance = abs(b[inner] - a[outer][inner])
if min_distance > current_distance:
min_distance = current_distance
c.append(min_distance)
# c=[3, 1, 6, 1, 2]
Elementwise iteration is very slow. What is the numpy way to do this?
If I understand your goal correctly, I think that this would do:
>>> c = np.min(np.abs(np.array(a) - b), axis = 0)
>>> c
array([3, 1, 6, 1, 2])

Calculating the sum of vectors

I have two matrices a and b and would like to calculate all the sums between them into a tensor. How can I do this more efficiently than doing the following code:
a = np.array([[1,2],[3,4],[5,6]])
b = np.array([[4,5],[6,7]])
n1 = a.shape[0]
n2 = b.shape[0]
f = a.shape[1]
c = np.zeros((n1,n2,f))
c = np.zeros((n1,n2,f))
for i in range(n1):
for j in range(n2):
c[i,j,:] = a[i,:] + b[j,:]
einstein-sum and the like does obviously not work and an outer product neither - is there an appropriate method?
You can transform your loop expression into a broadcasting one:
c[i,j,:] = a[i,:] + b[j,:]
c[i,j,:] = a[i,None,:] + b[None,j,:] # fill in the missing dimensions
c = a[:,None,:] + b[None,:,:]
In [167]: a[:,None,:]+b[None,:,:]
Out[167]:
array([[[ 5, 7],
[ 7, 9]],
[[ 7, 9],
[ 9, 11]],
[[ 9, 11],
[11, 13]]])
In [168]: _.shape
Out[168]: (3, 2, 2)
a[:,None]+b does the same thing, since leading None (np.newaxis) are automatic, and trailing : also.
Use broadcasting and add extra dimensions using advanced indexing:
a[:,None,:]+b[None,:,:]