Interpret the Doc2Vec Vectors Clusters Representation - text-mining

I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.

There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)

Related

Implementations of (fully) dynamic connectivity data structures

The dynamic connectivity problem for graphs consists in maintaining a graph data structure that allows for adding and deleting edges of the graph.
Moreover, the data structure should support connectivity queries.
Typically, such a query is of the form ''Are the nodes u and v connected in the graph?''
There are variants of the dynamic connectivity problem that also support different connectivity queries like 2-edge-connectivity or biconnectivity.
My question is: Are there existing efficient implementations of dynamic connectivity data structures?
By efficient I mean that data structures with a low amortized operation costs.
In particular, I am NOT interested in trivial implementations with a complexity of O(n) per operation!
Below I describe in more detail what I am looking for an what I already know.
If only edge insertions are allowed the dynamic connectivity problem can be solved by the well known disjoint-set (aka union find) data structure.
For this data structure there are implementations available in many different programming languages.
Unfortunately, this does not seem to be the case for the dynamic connectivity problem that also allows edge deletions.
The situation is even worse for data structures that also allow other connectivity queries like 2-edge- or biconnectivity.
To the best of my knowledge the algorithms presented in Holm et al. (2001) are still state of the art for many dynamic connectivity problems.
This publication was accompanied by an experimental study, however, as far as I can tell the code was never made publicly available. Also, therein only implementations for the regular connectivity problem are discussed, not for 2-edge- or biconnectivity.
The algorithms by Holm et al. (and also by other authors) are highly non-trivial.
Even though the algorithms are described in much detail it requires a lot of expertise to implement these algorithms in practice.
Because of this I am looking for existing implementation of different dynamic connectivity data structures.
The table below summarizes the (currently underwhelming) implementations of different combinations of supported manipulations and queries.
Graph Manipulations
Connectivity
2-edge-connectivity
Biconnectivity
incremental (adding edges)
disjoint-set
decremental (deleting edges)
Rafael Glikis
fully (adding and deleting edges)
I have searched for implementations in different places. I have looked on git-hub, I have looked through the external links in the relevant Wikipedia articles and I have skimmed through a lot of literature without any success.
I expect we will need a framework for trying things out so that we can discuss this in concrete terms.
I have implemented a small windows application that accepts user queries to read, build, edit and query the connectivity of a graph, showing the time taken to execute each.
Sample run:
Supported queries
add v1 v2 : add link to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.539246 0.539246 query
type query> delete 23 20
4720 vertices 27443 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.004432 0.004432 query
type query> add 23 20
4720 vertices 27444 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
1 0.0046639 0.0046639 query
The complete application is at https://github.com/JamesBremner/graphConnectivity
To demonstrate how this application can be used, I built it with the graph engine at https://github.com/JamesBremner/PathFinderFeb2023 and ran it on a couple of the test datasets from https://dyngraphlab.github.io/
dataset
edge count
delete
add
3elt.graph.seq.txt
27,443
5ms
5ms
144.graph.seq.txt
2,148,787
13ms
13ms
To get the average time to perform multiple queries, use the random command, like this:
Supported queries
add v1 v2 : add link to graph
add random n : add n random links to graph
delete v1 v2 : remove link from graph
reach src dst : find path between vertices
read filepath : input graph links from file
help : this help display
type query> read ../dat/3elt.graph.seq.txt
4720 vertices 27444 edges
type query> add random 10
4720 vertices 27454 edges
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
10 1.62e-06 1.62e-05 randomAdd

Can RDF/SPARQL be used for sub-graph matching?

I would like to build a knowledge graph of a set of instances, where each instance is itself a collection of ordered sub-instances. As a simple example, let's assume my instances are chains of marbles {CHAIN1, CHAIN2, CHAIN3, ...} and the sub-instances are colored marbles {CHAIN1: YELLOW-RED-BLUE-RED; CHAIN2: BLUE-YELLOW-GREEN; CHAIN3: GREEN-RED-BLUE-RED}.
Just to clarify, an incorrect approach would define CHAIN1 something like this:
:CHAIN1 :has_marble :YELLOW, :RED, :BLUE, :RED
but querying this would clearly only yield a "bag of marbles" situation.
I would like to be able to:
Query the knowledge graph such that I can get back the marbles for each chain in the correct order.
Match sequences of marbles between different chains. For example, I might want to get all the chains that have the sequence :RED-:BLUE-:RED as a sub-sequence (i.e., CHAIN1 and CHAIN3).
Questions:
What would be the best way of building this knowledge graph? Should I store the marbles as RDF sequences using rdf:first/rdf:rest? Or is there a better, more flexible option? If possible, I would like to be able to define the type of relation between the marbles, say :RED :is_followed_by :BLUE.
Is the type of graph matching I'm after possible? And how about if I'd like to match the sequences using some properties that describe each marble? Say, :BLUE :has_shape :SQUARE, and match the sequence of marbles by their shape?
Note: What I really want to model are chains of DNA and protein sequences, so if anyone has specific recommendations for such applications, that would be even more helpful.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Range of possible values for alpha, gamma and eta params of HLDA's Mallet implementation

I'm trying to run the hlda algorytmm and producing a descriptive hierarchy of the input documents. The problem is I'm running diverse parameters configs and trying to understand how it works in an "empirical way", because I can not match the ones that are being used in the original papers (I understand it's a different team). E.g. alpha in Mallet seems to be eta in the paper, but I'm not very sure. Besides, I can not know the boundaries for each of them. I mean, the range of possible values for each parameter.
In the source code, there is some help:
double alpha; // smoothing on topic distributions
double gamma; // "imaginary" customers at the next
double eta; // smoothing on word distributions.
First, I used the default values: alpha=10.0; gamma=1.0; eta = 0.1;
Then, I tryed running the algorythm by changing the values and interpret the results, but I can't understand the meaning of them. E.g. I think changing gamma (in Mallet) has an effect on the customers decition: to start a new node in the tree or to be placed in an existing one. So, if I set gamma = 0.5, less nodes should be produced, because 0.5 is half the probability of the default one, right? But the results with gamma=1 give me 87 nodes, and with gamma=0.5, it returns 98! And then, I'm asking me something new: is that a probability? I was trying to find the range of possible values in these two papers, but I didn't find them:
Hierarchical Topic Models andthe Nested Chinese Restaurant Process
The Nested Chinese Restaurant Process and BayesianNonparametric Inference of Topic Hierarchies
I know I could be missing something, because I don't have the a good background on this, but that's why I'm asking here, maybe someone already had this problem and can help me understanding those limits.
Thanks in advance!
It may be helpful to run multiple times with each hyperparameter setting. I suspect that gamma does not have a big influence on the final number of topics, and that what you are seeing could just be typical variability in the sampling process.
In my experience the parameter that has by far the strongest influence on the number of topics is actually eta, the topic-word smoothing.

Pairwise cosine similarity

I'm a little confused when I read this paper:Pairwise Document Similarity in Large Collections with MapReduce
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
In this paper, the author seems didn't consider word only appears in one document, but according to the definition of cosine similarity, we need to consider this situation, right?
The material I used is this: https://www.dropbox.com/s/nctb66hh84ab32c/postings-Reuters-data
The java code I used is this: https://www.dropbox.com/s/aklviixup4uulmu/CosineSimilarity.java
And the results I generated is this: https://www.dropbox.com/s/ea6ov7l7yut7yfj/part-00000
In the results, I see a lot of 1's and even number bigger than 1. I think it's kind of weird, could someone help me find out the reason? Thanks.