Assuming that I have a word similarity score for each pair of words in two sentences, what is a decent approach to determining the overall sentence similarity from those scores?
The word scores are calculated using cosine similarity from vectors representing each word.
Now that I have individual word scores, is it too naive to sum the individual word scores and divide by the total word count of both sentences to get a score for the two sentences?
I've read about further constructing vectors to represent the sentences, using the word scores, and then again using cosine similarity to compare the sentences. But I'm not familiar with how to construct sentence vectors from the existing word scores. Nor am I aware of what the tradeoffs are compared with the naive approach described above, which at the very least, I can easily comprehend. :).
Any insights are greatly appreciated.
Thanks.
What I ended up doing, was taking the mean of each set of vectors, and then applying cosine-similarity to the two means, resulting in a score for the sentences.
I'm not sure how mathematically sound this approach is, but I've seen it done in other places (like python's gensim).
It would be better to use contextual word embeddings(vector representations) for words.
Here is an approach to sentence similarities by pairwise word similarities: BERTScore.
You can check the math here.
Related
I have task of sentence similarity where i calculate the cosine of two sentence to decide how similar they are . It seems that for sentence with digits the similarity is not affected no matter how "far" the numbers are . For an example:
a = generate_embedding('issue 845')
b = generate_embedding('issue 11')
cosine_sim(a,b) = 0.9307
is there a way to distance the hashing of numbers or any other hack to handle that issue?
If your sentence embedding are produced using the embeddings of individual words (or tokens), then a hack could be the following:
to add dimensions to the word embedding. These dimensions would be set to zero for all non-numeric tokens, and for numeric tokens these dimensions would contain values reflecting the magnitude of the numeric value. It would get a bit mathematical because cosine similarity uses angles, so the extra dimensions added to the embedding would have to reflect the magnitude of the numeric values through larger or smaller angles.
An easier (workaround) hack would be to extract the numeric values from the sentences using regular expressions and compute their distance and combine that information with the similarity score in order to obtain a new similarity score.
I'm quite new to text mining and I'm challenging my self to do the sentiment analysis today. But I encounter some problems while doing the sentiment analysis.
In my language, a word can have some different meanings. Like "setan" means : 1) devils 2) cursing words. How to solve this ambiguity in sentiment analysis?
Also for everyone's information, the algorithm that I use is naive bayes classifier. And for the tools, I'm using RapidMiner.
I need your help. Any tips would be great. Thank you!
Training your data on a Naive Bayes classifier would make the model assign a probability for each word for every different class that you are trying to classify. In your case, since it's sentiment analysis, if you have Positive and Negative as the two classes, you would have probability for setan being Positive and Negative.
Keeping this in mind, if a word has multiple meanings that could account for both positive and negative sentiment, I would say make sure to include both kind of instances in your data so that while training the model, the corresponding probabilities are used to classify new text into Positive or Negative class.
In your case, it seems like both the meanings of setan have a negative connotation which really shouldn't be a problem. Words like "the","a" which are present in both Positive and Negative instances, famously called the stopwords should be removed since they don't really count towards the classification.
In your case if you are trying to train the model using their meanings specifically, you can refer this paper https://pdfs.semanticscholar.org/fc01/b42df3077a512620456d8a2714951eccbd67.pdf.
I ask because i'd like to use it to process the text input that I will be using for my LSTM.
Any feedback would be much appreciated.
As the name suggests, it is "word" to vector. What it does is, to represent words in their vector form. It's more like placing similar words grouped together in the space.
Like, 'cat' and 'kitten' represent similar meaning, so they will be closer to each other, i.e, their vector representation will be similar. Whereas, vector representation of 'man' will be quite far apart when placed in the same space.
Here is a beautiful blog post which talk about word2vec in detail.
I'm using the most_similar() method as below to get all the words similar to a given word:
word,score= model.most_similar('apple',topn=sizeofdict)
AFAIK, what this does is, calculate the cosine similarity between the given word and all the other words in the dictionary. When i'm inspecting the words and scores, I can see there are words with negative score down the list. What does this mean? are them the words that has opposite meaning to the given word?
Also if it's using cosine similarity, how does it get a negative value? cosine similarity varies between 0-1 for two documents.
Yes, it does calculate cosine similarity between the given word and all the other words in the vocabulary
No, negative score doesn't mean the two words have opposite meaning. Cosine similarity is part of the cost function used in training word2vec model. The model is reducing the angle between vectors of similar words, so similar words be clustered together in the high dimensional sphere. Typically, for word vectors, cosine similarity > 0.6 means they are similar in meaning.
No, cosine similarity between two vectors lie between -1 and 1. [0, 1] similarity implies vectors having angles between 0 and 90 degrees. Negative similarity implies angles between 90 and 180 degrees.
I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.
As I understood, VSM is a matrix where the values of TFIDF of each term are filled. When i tried building VSM from a set of documents, it took a long time with this tool http://sourceforge.net/projects/wvtool/
This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene.
In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. I guess that's also related to how it builds VSM internally. If possible, can someone also explain this ?
so please help me to understand two point here:
1. how lucene builds VSM so fast which can be used for calculating similarity.
2. how come lucene similarity calculation amoung millions of documents is so fast.
I'd appreciate it if an real example is given.
Thanks
As I understood, VSM is a matrix where the values of TFIDF of each term are filled.
This is more properly called a term-document matrix. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise.
Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. It uses those to compute a variant of cosine similarity outlined here. So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value.
one has to compute similarity to everyone else
That's not true. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.