Most related Hypernym of a synset in some context - wordnet

I have a word say w. For w, I used Lesk algorithm to get the synset s it should belong to given I have a context. Now for this synset s, I want a hypernym out of all the hypernyms such that it is also most relevant in the context of the word w. Is there an algorithm for this in python?

Related

What impurity index (Gini, entropy?) is used in TensorFlow Random Forests with CART trees?

I was looking for this information in the tensorflow_decision_forests docs (https://github.com/tensorflow/decision-forests) (https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/wrappers/CartModel) and yggdrasil_decision_forests docs (https://github.com/google/yggdrasil-decision-forests).
I've also taken a look at the code of these two libraries, but I didn't find that information.
I'm also curious if I can specify an impurity index to use.
I'm looking for some analogy to sklearn decision tree, where you can specify the impurity index with criterion parameter.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
For TensorFlow Random Forest i found only a parameter uplift_split_score:
uplift_split_score: For uplift models only. Splitter score i.e. score
optimized by the splitters. The scores are introduced in "Decision trees
for uplift modeling with single and multiple treatments", Rzepakowski et
al. Notation: p probability / average value of the positive outcome,
q probability / average value in the control group.
- KULLBACK_LEIBLER or KL: - p log (p/q)
- EUCLIDEAN_DISTANCE or ED: (p-q)^2
- CHI_SQUARED or CS: (p-q)^2/q
Default: "KULLBACK_LEIBLER".
I'm not sure if it's a good lead.
No, you shouldn't use uplift_split_score, because it is For uplift models only.
Uplift modeling is used to estimate treatment effect or other tasks in causal inference

Inverse transform word count vector to original document

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.

how to do text clustering from cosine similarity

I am using WEKA for performing text collection. Suppose i have n documents with text, i calculated TFID as feature vector for each document and than calculated cosine similarity between each of each of the document.it generated nXn matrix. Now i wonder how to use this nxn matrix in k-mean algorithm . i know i can apply some dimension reduction such as MDS or PCA. What I am confused here is that after applying dimension reduction how will i identify that document itself, for example if i have 3 documents d1,d2 d3 than cosine will give me distances between d11,d12,d13
d21,d22,d23
d31,d32,d33
now i am not sure what will be output after PCA or MDS and how i will identify the documents after kmean. Please suggest. I hope i have put my question clearly
PCA is used on the raw data, not on distances, i.e. PCA(X).
MDS uses a distance function, i.e. MDS(X, cosine).
You appear to believe you need to run PCA(cosine(X))? That doesn't work.
You want to run MDS(X, cosine).

Weighted cosine similarity calculation using Lucene

This question is related with calculating CosineSimilarity between documents using Lucene
The documents are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity between documents, I want to give higher weights to those Taxonomy terms and Ontology terms.
When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program.
Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);
I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values.
I can use Lucene’s field.setBoost() function to give higher weights to the fields before indexing. I used the debugger see frequency values of Taxonomy terms after seeting a boost value, but it dosen’t change the term frequencies. So that means setboost() function dosen’t give any effect on TermFreVector or TFIDF values? Is setboost() function increase the weights and can be used only in document searching?
Another thing what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation?
Are there any other Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? Or can I just use the setboost() function for this purpose, then how?
The TermFreqVector is just that - the term frequencies. No weights. It says in the docs "Each location in the array contains the number of times this term occurs in the document or the document field."
You can see from Lucene's algorithm that the way boosts are used is as a multiplicative factor. So if you want to replicate that then yes this will give your terms a higher weight.
I'm not sure what your use case is, but you might want to consider just using Lucene's Scorer class. Then you won't have to deal with making your own.

Does a larger tf always boost a documents score in Lucene?

I understand that the default term frequency (tf) is simply calculated as the sqrt of number of times a particular term being searched appears in a field. So documents containing multiple occurences of a term you are searching on will have a higher tf and hence weight.
What I'm unsure about is whether this helps increase the documents score because the weight is higher or reduces the documents score because its move the document vector away from the query vector as the book Hibernate Search in Action seems to be saying (pg 363). I confess I'm really struggling to see how the document vector model fits in with lucene scoring equation
I don't have this book to check, but basically (if we ignore the different boosts that can be set manually at indexing time), there are three reasons why the score of some document may be higher (or lower) than the score of other documents with Lucene's default scoring model and for a given query:
the queried term has a low document frequency (boosting the IDF part of the score),
the queried term has a high number of occurrences in the document (boosting the TF part of the score),
the queried term appears in a rather small field of the document (boosting the norm part of the score).
This means that for two documents D1 and D2 and one queried term T, if
T appears n times in D1,
T appears p > n times in D2,
the queried field of D2 has (almost) the same size (number of terms) as D1,
D2 will have a better score than D1.