Pairwise cosine similarity - cosine-similarity

I'm a little confused when I read this paper:Pairwise Document Similarity in Large Collections with MapReduce
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
In this paper, the author seems didn't consider word only appears in one document, but according to the definition of cosine similarity, we need to consider this situation, right?
The material I used is this: https://www.dropbox.com/s/nctb66hh84ab32c/postings-Reuters-data
The java code I used is this: https://www.dropbox.com/s/aklviixup4uulmu/CosineSimilarity.java
And the results I generated is this: https://www.dropbox.com/s/ea6ov7l7yut7yfj/part-00000
In the results, I see a lot of 1's and even number bigger than 1. I think it's kind of weird, could someone help me find out the reason? Thanks.

Related

Interpret the Doc2Vec Vectors Clusters Representation

I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.
There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Range of possible values for alpha, gamma and eta params of HLDA's Mallet implementation

I'm trying to run the hlda algorytmm and producing a descriptive hierarchy of the input documents. The problem is I'm running diverse parameters configs and trying to understand how it works in an "empirical way", because I can not match the ones that are being used in the original papers (I understand it's a different team). E.g. alpha in Mallet seems to be eta in the paper, but I'm not very sure. Besides, I can not know the boundaries for each of them. I mean, the range of possible values for each parameter.
In the source code, there is some help:
double alpha; // smoothing on topic distributions
double gamma; // "imaginary" customers at the next
double eta; // smoothing on word distributions.
First, I used the default values: alpha=10.0; gamma=1.0; eta = 0.1;
Then, I tryed running the algorythm by changing the values and interpret the results, but I can't understand the meaning of them. E.g. I think changing gamma (in Mallet) has an effect on the customers decition: to start a new node in the tree or to be placed in an existing one. So, if I set gamma = 0.5, less nodes should be produced, because 0.5 is half the probability of the default one, right? But the results with gamma=1 give me 87 nodes, and with gamma=0.5, it returns 98! And then, I'm asking me something new: is that a probability? I was trying to find the range of possible values in these two papers, but I didn't find them:
Hierarchical Topic Models andthe Nested Chinese Restaurant Process
The Nested Chinese Restaurant Process and BayesianNonparametric Inference of Topic Hierarchies
I know I could be missing something, because I don't have the a good background on this, but that's why I'm asking here, maybe someone already had this problem and can help me understanding those limits.
Thanks in advance!
It may be helpful to run multiple times with each hyperparameter setting. I suspect that gamma does not have a big influence on the final number of topics, and that what you are seeing could just be typical variability in the sampling process.
In my experience the parameter that has by far the strongest influence on the number of topics is actually eta, the topic-word smoothing.

Fourier Transformation -

I've been doing a lot of research on this topic and I'm finally getting somewhere. Below is two complex numbers from the java code I'm using:
-9771.0 - j2125.0
-16184.09634718744 - j53968.71008512241
I know the amplitude/magnitude can be computed by doing the sqrt(a^2 + b^2) and this as far as I've gotten with this. I've read about sample rate but I'll need a better explanation of this alone and would like to be pointed in the right direction to obtain the knowledge. I've done the powerspectum graph but I need to do this on paper so I'll know how to obtain the frequency.
Applying Fourier Transformation to two values is pretty meaningless. You apply it to series of values (signal), then frequency starts to make sense. You can't speak about frequency in series of two values.

Lucene: Setting minimum required similarity on searches

I'm having a lot of trouble dealing with Lucene's similarity factor. I want it to apply a similarity factor different than its default (which is 0.5 according to documentation), but it doesn't seem to be working.
When I type a query that explicitly sets the required similarity factor, like [tinberland~0.5] (notice that I wrote tiNberland, with an "N", while the correct would be with an "M"), it brings many products by the Timberland manufacturer. But when I just type [tinberland] (no similarity factor explicitly defined) and try to set the similarity via code, it doesn't work (returns no results).
The code I wrote to set the similarity is like:
multiFieldQueryParser.SetFuzzyMinSim(0.5F);
And I didn't change the Similarity algorithm, so it is using the DefaultSimilarity class.
Isn't that the correct or recommended way of applying similarity via code? Is there a specific QueryParser for fuzzy queries?
Any help is highly appreciated.
Thanks in advance!
What you are setting is the minimal similarity, so e.g. if someone searched for foo~.1 the parser would change it to foo~.5. It's not saying "turn every query into a fuzzy query."
You can use MultiFieldQueryParser.getFuzzyQuery like so:
Query q = parser.getFuzzyQuery(field, term, minSimilarity);
but that will of course require you calling getFuzzyQuery for each field. I'm not aware of a "MultiFieldFuzzyQueryParser" class, but all it would do is just combine a bunch of those getFuzzyQuery calls.