I have task of sentence similarity where i calculate the cosine of two sentence to decide how similar they are . It seems that for sentence with digits the similarity is not affected no matter how "far" the numbers are . For an example:
a = generate_embedding('issue 845')
b = generate_embedding('issue 11')
cosine_sim(a,b) = 0.9307
is there a way to distance the hashing of numbers or any other hack to handle that issue?
If your sentence embedding are produced using the embeddings of individual words (or tokens), then a hack could be the following:
to add dimensions to the word embedding. These dimensions would be set to zero for all non-numeric tokens, and for numeric tokens these dimensions would contain values reflecting the magnitude of the numeric value. It would get a bit mathematical because cosine similarity uses angles, so the extra dimensions added to the embedding would have to reflect the magnitude of the numeric values through larger or smaller angles.
An easier (workaround) hack would be to extract the numeric values from the sentences using regular expressions and compute their distance and combine that information with the similarity score in order to obtain a new similarity score.
Related
I wanted to implement a classification algorithm using NN but some columns have complex alphanumeric strings, so I just chose only the simpler columns to check. Here is an example with few elements of the columns I chose...
Few Elements of the COL
As you can see these columns have A,G,C or T..etc. Some had combinations of the 4 letters but I removed it for now. My plan was to map each of these letters to values like 1,2,3 and 4 and then feed them to the NN.
Is this mapping acceptable for feeding into a dense NN?? Or is there any better method for doing this
I would not map it to integers like 1, 2, 3 etc because you are mistakenly giving them a certain order or rank which the NN may capture as important, although this ranking does not truly exist.
If you do not have high cardinality (many unique values) then you can apply One-Hot Encoding. If the cardinality is high, then you should use other encoding techniques, otherwise one-hot encoder will introduce a lot of dimensionality to your data and sparsity, which are not welcomed. You can find here some other interesting methods to encode categorical variables.
I am using WEKA for performing text collection. Suppose i have n documents with text, i calculated TFID as feature vector for each document and than calculated cosine similarity between each of each of the document.it generated nXn matrix. Now i wonder how to use this nxn matrix in k-mean algorithm . i know i can apply some dimension reduction such as MDS or PCA. What I am confused here is that after applying dimension reduction how will i identify that document itself, for example if i have 3 documents d1,d2 d3 than cosine will give me distances between d11,d12,d13
d21,d22,d23
d31,d32,d33
now i am not sure what will be output after PCA or MDS and how i will identify the documents after kmean. Please suggest. I hope i have put my question clearly
PCA is used on the raw data, not on distances, i.e. PCA(X).
MDS uses a distance function, i.e. MDS(X, cosine).
You appear to believe you need to run PCA(cosine(X))? That doesn't work.
You want to run MDS(X, cosine).
I'm using the most_similar() method as below to get all the words similar to a given word:
word,score= model.most_similar('apple',topn=sizeofdict)
AFAIK, what this does is, calculate the cosine similarity between the given word and all the other words in the dictionary. When i'm inspecting the words and scores, I can see there are words with negative score down the list. What does this mean? are them the words that has opposite meaning to the given word?
Also if it's using cosine similarity, how does it get a negative value? cosine similarity varies between 0-1 for two documents.
Yes, it does calculate cosine similarity between the given word and all the other words in the vocabulary
No, negative score doesn't mean the two words have opposite meaning. Cosine similarity is part of the cost function used in training word2vec model. The model is reducing the angle between vectors of similar words, so similar words be clustered together in the high dimensional sphere. Typically, for word vectors, cosine similarity > 0.6 means they are similar in meaning.
No, cosine similarity between two vectors lie between -1 and 1. [0, 1] similarity implies vectors having angles between 0 and 90 degrees. Negative similarity implies angles between 90 and 180 degrees.
Is there any available DM script that can compare two images and know the difference?
I mean the script can compare two or more images, and it can determine the similarity of two images, for example the 95% area of one image is same as another image, then the similarity of these two images is 95%.
The script can compare brightness and contrast distribution of images.
Thanks,
This question is a bit ill-defined, as "similarity" between images depends a lot on what you want.
If by "95% of the area is the same" you mean that 95% of the pixels are of identical value in images A & B, you can simply create a mask and sum() it to count the number of pixels, i.e.:
sum( abs(A-B)==0 ? 1 : 0 )
However, this will utterly fail if the images A & B are shifted with respect to each other even by a single pixel. It will also fail, if A & B are of same contrast but different absolute value.
I guess the intended question was to find similarity of two images in a fuzzy way.
For these, one way is to do crosscorrelation. DM has this function. Like this,
image xcorr= CrossCorrelate(ref,img)
From xcorr, the peak position gives x- and y- shift between the two, the peak intensity gives "similarity" of the two.
If you know there is no shift between the two, you can just do the sum and multiplication,
number similarity1=sum(img1*img2)
Another way to do similarity is calculate Euclidian distance of the two:
number similarity2=sqrt(sum((img1-img2)**2)).
"similarity2" calculates the "pure" similarity. "similarity1" is the pure similarity plus the mean intensity of img1 and img2. The difference is essentially this,
(a-b)**2=a**2+b**2-2*a*b.
The left term is "similarity2", the last term on the right is the "crosscorrelation" or "similarity1".
I think "similarity1" is called cross-correlation, "similarity2" is called correlation coefficient.
In example comparing two diffraction patterns, if you want to compute the degree of similarity, use "similarity2". If you want to compute the degree of similarity plus a certain character of the diffraction pattern, use "similarity1".
Assuming that I have a word similarity score for each pair of words in two sentences, what is a decent approach to determining the overall sentence similarity from those scores?
The word scores are calculated using cosine similarity from vectors representing each word.
Now that I have individual word scores, is it too naive to sum the individual word scores and divide by the total word count of both sentences to get a score for the two sentences?
I've read about further constructing vectors to represent the sentences, using the word scores, and then again using cosine similarity to compare the sentences. But I'm not familiar with how to construct sentence vectors from the existing word scores. Nor am I aware of what the tradeoffs are compared with the naive approach described above, which at the very least, I can easily comprehend. :).
Any insights are greatly appreciated.
Thanks.
What I ended up doing, was taking the mean of each set of vectors, and then applying cosine-similarity to the two means, resulting in a score for the sentences.
I'm not sure how mathematically sound this approach is, but I've seen it done in other places (like python's gensim).
It would be better to use contextual word embeddings(vector representations) for words.
Here is an approach to sentence similarities by pairwise word similarities: BERTScore.
You can check the math here.