row-wise calculation of cosine similarity in pandas without looping - pandas

I have a pandas dataframe df with many rows. For each row, I want to calculate the cosinus similarity between the row's columns A (first vector) and the row's columns B (second vector). At the end, I aim to get a vector with one cosine similarity value for each row. I have found a solution but it seems to me like it could be done much faster without this loop. May anyone give me some feedback on this code?
Thank you very much!
for row in np.unique(df.index):
cos_sim[row]=scipy.spatial.distance.cosine(df[df.index==row][columnsA],
df[df.index==row][columnsB])
df['cos_sim']=cos_sim
Here comes some sample data:
df = pd.DataFrame({'featureA1': [2, 4, 1, 4],
'featureA2': [2, 4, 1, 4],
'featureB1': [10, 2, 1, 8]},
'featureB2': [10, 2, 1, 8]},
index=['Pit', 'Mat', 'Tim', 'Sam'])
columnsA=['featureA1', 'featureA2']
columnsB=['featureB1', 'featureB2']
This is my desired output (cosine similarity for Pit, Mat, Tim and Sam):
cos_sim=[1, 1, 1, 1]
I am already receiving this output with my method, but I am sure the code could be improved from a performance perspective

several things you can improve on :)
Take a look at the DataFrame.apply function. pandas already offers you looping "under the hood".
df['cos_sim'] = df.apply(lambda _df: scipy.spatial.distance.cosine(_df[columnsA], _df[columnsB])
or something similar should be more performant
Also take a look at DataFrame.loc
df[df.index==row][columnsA]
and
df.loc[row,columnsA]
should be equivalent
If you really have to iterate over the dataframe (should be avoided again due to performance penalties and it is more difficult to read and understand), pandas gives you a generator for the rows (and id)
for index, row in df.iterrows():
scipy.spatial.distance.cosine(row[columnsA], row[columnsB])
Finally as mentioned above to get better answers on stackoverflow, always provide a concrete example where the problem is reproducible. Otherwise it is much harder to interpret the question correctly and to test a solution.

Pretty old post but I am replying for future readers. I created https://github.com/ma7555/evalify for all those rowwise similarity/distance calculations (disclaimer: i am the owner of the package)

Related

Sampling non-repeating integers with given probability distribution

I need to sample n different values taken from a set of integers.
These integers should have different occurence probability. E.g. the largest the lukilier.
By using the random package I can sample a set of different values from the set, by maeans of the method
random.sample
However it doesn't seem to provide the possibility to associate a probability distribution.
On the other hand there is the numpy package which allows to associate the distribution, but it returns a sample with repetitions. This can be done with the method
numpy.random.choice
I am looking for a method (or a way around) to do what the two methods do, but together.
You can actually use numpy.random.choice as it has the replace parameter. If set to False, the sampling will be done wihtout remplacement.
Here's a random example:
>>> np.random.choice([1, 2, 4, 6, 9], 3, replace=False, p=[1/2, 1/8, 1/8, 1/8, 1/8])
>>> array([1, 9, 4])

How to calculate the time complexity for backtracking with pruning?

The question is something like this,
Given a list of edges, find a path from SRC to DEST that gives the highest points.
INPUT: [['A', 'B', 5] , ['A', 'C', 2], ['B', 'C', 5]], find path from A to B.
OUTPUT: ['A', 'C', 'B'], which gives 10 points.
I understand this is a graph problem, one way to solve it will be DFS with backtracking whereby we try out all the options that lead us from A to B, and record the one with the highest score.
The time complexity for this will probably be N! in a fully connected graph, since we are trying all permutations.
However, I think we can optimise it by pruning while backtracking. E.g. We keep track of the highest points we can get at each node. If the current point is less than the existing highest point, we don't have to try anymore.
But I can't quite figure out the time complexity for this with pruning. Will the worst case still be N! since we can technically still do all permutations?
This problem is very similar to the TSP (Traveling Salesman Problem).
One way to solve this problem is dynamic programming with bit-masking, which runs in O(2^N * N^2) which is better than O(N!).

Higher (4+) dimension/axis in numpy...are they ever actually used in computation?

what I mean by the title is that sometimes I come across code that requires numpy operations (for example sum or average) along a specified axis. For example:
np.sum([[0, 1], [0, 5]], axis=1)
I can grasp this concept, but do we actually ever do these operations also along higher dimensions? Or is that not a thing? And if yes, how do you gain intuition for high-dimensional datasets and how do you make sure you are working along the right dimension/axis?

numpy: Broadcasting a vector horizontally

I have a 1-D array in numpy v. I'd like to copy it to make a matrix with each row being a copy of v. That's easy: np.broadcast_to(v, desired_shape).
However, if I'd like to treat v as a column vector, and copy it to make a matrix with each column being a copy of v, I can't find a simple way to do it. Through trial and error, I'm able to do this:
np.broadcast_to(v.reshape(v.shape[0], 1), desired_shape)
While that works, I can't claim to understand it (even though I wrote it!).
Part of the problem is that numpy doesn't seem to have a concept of a column vector (hence the reshape hack instead of what in math would just be .T).
But, a deeper part of the problem seems to be that broadcasting only works vertically, not horizontally. Or perhaps a more correct way to say it would be: broadcasting only works on the higher dimensions, not the lower dimensions. I'm not even sure if that's correct.
In short, while I understand the concept of broadcasting in general, when I try to use it for particular applications, like copying the col vector to make a matrix, I get lost.
Can you help me understand or improve the readability of this code?
https://en.wikipedia.org/wiki/Transpose - this article on Transpose talks only of transposing a matrix.
https://en.wikipedia.org/wiki/Row_and_column_vectors -
a column vector or column matrix is an m × 1 matrix
a row vector or row matrix is a 1 × m matrix
You can easily create row or column vectors(matrix):
In [464]: np.array([[1],[2],[3]]) # column vector
Out[464]:
array([[1],
[2],
[3]])
In [465]: _.shape
Out[465]: (3, 1)
In [466]: np.array([[1,2,3]]) # row vector
Out[466]: array([[1, 2, 3]])
In [467]: _.shape
Out[467]: (1, 3)
But in numpy the basic structure is an array, not a vector or matrix.
[Array in Computer Science] - Generally, a collection of data items that can be selected by indices computed at run-time
A numpy array can have 0 or more dimensions. In contrast in MATLAB matrix has 2 or more dimensions. Originally a 2d matrix was all that MATLAB had.
To talk meaningfully about a transpose you have to have at least 2 dimensions. One may have size one, and map onto a 1d vector, but it still a matrix, a 2d object.
So adding a dimension to a 1d array, whether done with reshape or [:,None] is NOT a hack. It is a perfect valid and normal numpy operation.
The basic broadcasting rules are:
a dimension of size 1 can be changed to match the corresponding dimension of the other array
a dimension of size 1 can be added automatically on the left (front) to match the number of dimensions.
In this example, both steps apply: (5,)=>(1,5)=>(3,5)
In [458]: np.broadcast_to(np.arange(5), (3,5))
Out[458]:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
In this, we have to explicitly add the size one dimension on the right (end):
In [459]: np.broadcast_to(np.arange(5)[:,None], (5,3))
Out[459]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
np.broadcast_arrays(np.arange(5)[:,None], np.arange(3)) produces two (5,3) arrays.
np.broadcast_arrays(np.arange(5), np.arange(3)[:,None]) makes (3,5).
np.broadcast_arrays(np.arange(5), np.arange(3)) produces an error because it has no way of determining whether you want (5,3) or (3,5) or something else.
Broadcasting always adds new dimensions to the left because it'd be ambiguous and bug-prone to try to guess when you want new dimensions on the right. You can make a function to broadcast to the right by reversing the axes, broadcasting, and reversing back:
def broadcast_rightward(arr, shape):
return np.broadcast_to(arr.T, shape[::-1]).T

How to get real prediction from TensorFlow

I'm really new to TensorFlow and MI in general. I've been reading a lot, been searching days, but haven't really found anything useful, so..
My main problem is that every single tutorial/example uses images/words/etc., and the outcome is just a vector with numbers between 0 and 1 (yeah, that's the probability). Like that beginner tutorial, where they want to identify numbers in images. So the result set is just a "map" to every single digit's (0-9) probability (kind of). Here comes my problem: I have no idea what the result could be, it could be 1, 2, 999, anything.
So basically:
input: [1, 2, 3, 4, 5]
output: [2, 4, 6, 8, 10]
This really is just a simplified example. I would always have the same number of inputs and outputs, but how can I get the predicted values back with TensorFlow, not just something like [0.1, 0.1, 0.2, 0.2, 0.4]? I'm not really sure how clear my question is, if it's not, please ask.