Getting back the values to sentences from cosine similarity - cosine-similarity

How will I get back the values from buckets,10: [10,28,40,50] , 20: [15,19,80] to back sentence.
Those were bucketed with cosine similarity. So the sentences were first converted with vectorizer.

Related

How does numpy manage to divide float32 by 2**63?

Here Daniel mentions
... you pick any integer in [0, 2²⁴), and divide it by 2²⁴, then you can recover your original integer by multiplying the result again by 2²⁴. This works with 2²⁴ but not with 2²⁵ or any other larger number.
But when I tried
>>> b = np.divide(1, 2**63, dtype=np.float32)
>>> b*2**63
1.0
Although it isn't working for 2⁶⁴, but I'm left wondering why it's working for all the exponents from 24 to 63. And moreover if it's unique to numpy only.
In the context that passage is in, it is not saying that an integer value cannot be divided by 225 or 263 and then multiplied to restore the original value. It is saying that this will not work to create an unbiased distribution of numbers.
The text leaves some things not explicitly stated, but I suspect it is discussing taking a value of integer type, converting it to IEEE-754 single-precision, and then dividing it. This will not work for factors larger than 224 because the conversion from integer type to IEEE-754 single-precision will have to round the number.
For example, for 232, all numbers from 0 to 16,777,215 will convert to themselves with no error, and then dividing by 232 will produce a unique floating-point number for each. But both 16,777,216 and 16,777,217 will convert to 16,777,216, and then dividing by 232 will produce the same number for them (1/256). All numbers from 2,147,483,520 to 2,147,483,776 will map to 2,147,483,648, which then produces ½, so that is 257 numbers mapping to one floating-point number. But all the numbers from 2,147,483,777 to 2,147,484,031 map to 2,147,483,904. So this one has 255 numbers mapping to it. (The difference is due to the round-to-nearest-ties-to-even rule.) At the high end, the 129 numbers from 4,294,967,168 to 4,294,967,296 map to 4,294,967,296, for which dividing produces 1, which is out of the desired half-open interval, [0, 1).
On the other hand, if we use integers from 0 to 16,777,215 (224−1), there is no rounding, and each result maps from exactly one starting number and stays within the interval.
Note that “significand“ is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old word for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic. And the significand of the IEEE-754 single-precision format has 24 bits, not 23. The primary field used to encode the significand has 23 bits, but the exponent field provides another bit.

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

Np.where function

I've got a little problem understanding the where function in numpy.
The ‘times’ array contains the discrete epochs at which GPS measurements exist (rounded to the nearest second).
The ‘locations’ array contains the discrete values of the latitude, longitude and altitude of the satellite interpolated from 10 seconds intervals to 1 second intervals at the ‘times’ epochs.
The ‘tracking’ array contains an array for each epoch in ‘times’ (array within an array). The arrays have 5 columns and 32 rows. The 32 rows correspond to the 32 satellites of the GPS constellation. The 0th row corresponds to the 1st satellite, the 31st to the 32nd. The columns contain the following (in order): is the satellite tracked (0), is L1 locked (1), is L2 locked (2), is L1 unexpectedly lost (3), is L2 unexpectedly lost (4).
We need to find all the unexpected losses and put them in an array so we can plot it on a map.
What we tried to do is:
i = 0
with np.load(r’folderpath\%i.npz' %i) as oneday_data: #replace folderpath with your directory
times = oneday_data['times']
locations = oneday_data['locations']
tracking = oneday_data['tracking']
A = np.where(tracking[:][:][4] ==1)
This should give us all the positions of the losses. With this indices it is easy to get the right locations. But it keeps returning useless data.
Can someone help us?
I think the problem is your dual slices. Further, having an array of arrays could lead to weird problems (I assume you mean an object array of 2D arrays).
So I think you need to dstack tracking into a 3D array, then do where on that. If the array is already 3D, then you can skip the dstack part. This will get the places where L2 is unexpectedly lost, which is what you did in your example:
tracking3d = np.dstack(tracking)
A0, A2 = np.where(tracking3d[:, 4, :]==1)
A0 is the position of the 1 along axis 0 (satellite), while A2 is the position of the same 1 along axis 2 (time epoch).
If the values of tracking can only be 0 or 1, you can simplify this by just doing np.where(tracking3d[:, 4, :]).
You can also roll the axes back into the configuration you were using (0: time epoch, 1: satellite, 2: tracking status)
tracking3d = np.rollaxis(np.dstack(tracking), 2, 0)
A0, A1 = np.where(tracking3d[:, :, 4]==1)
If you want to find the locations where L1 or L2 are unexpectedly lost, you can do this:
tracking3d = np.rollaxis(np.dstack(tracking), 2, 0)
A0, A1, _ = np.where(tracking3d[:, :, 3:]==1)
In this case it is the same, except there is a dummy variable _ used for the location along the last axis, since you don't care whether it was lost for L1 or L2 (if you do care, you could just do np.where independently for each axis).

Different document length in computing cosine similarity?

Is there any rule, when I like to find cosine similarity between two documents that have different number of words?
The standard formula does not require the number of words to match. You can just sum over the union of the words of both documents. All words that are in B but not in A give rise to a 0 in the word vector for A. All words that are in A but not in B give rise to a 0 in the word vector for B.

Sparse dot product in SQL

Imagine I have a table which stores a series of sparse vectors. A sparse vector means that it stores only the nonzero values explicitly in the data structure. I could have a 1 million dimensional vector, but I only store the values for the dimensions which are nonzero. So the size is proportional to the number of nonzero entries, not the dimensionality of the vector.
Table definition would be something like this:
vector_id : int
dimension : int
value : float
Now, in normal programming land I can compute the inner product or dot product of two vectors in O(|v1| + |v2|) time. Basically the algorithm is to store the sparse vectors sorted by dimension and iterate through the dimensions in each until you find collisions between dimensions and multiply the values of the shared dimension and keep adding those up until you get to the end of either one of the vectors.
What's the fastest way to pull this off in SQL?
You should be able to replicate this algorithm in one query:
select sum(v1.value * v2.value)
from vectors v1
inner join vectors v2
on v1.dimension = v2.dimension
where v1.vector_id = ...
and v2.vector_id = ...