feature selection and estimate the documents similarity in text mining - text-mining

I'm working on a text mining project by WEKA library in Java. In the preprocessing step I applied StringToWordVector filter. In this filter, I set several options like tokenizing, stop words removing, stemming, and TF-IDF weighting scheme.
I have some questions:
1- is it necessary to do a feature selection process in every text mining projects?
2- is it necessary to estimate the similarity of documents, for example: by using Cosine similarity?
or these two options are optional?
and is StringToWordVector filter does some of these?

It is not necessary. Nobody imposes you that step. But results usually improve with appropriate feature selection methods.
It is necesary if that is a goal of your project; it is not imposed by any means. The StringToWordVector filter only does that, convert your strings into wordVectors for further processing or analysis. It is up to you what you calculate from your data. If you need a similarity measure, then, cosine distance is a suitable measure.

Related

What is Dimensionality Reduction ? Feature Selection or extraction

In my knowledge, DR is a technique that transforms high dimensional data into lower dimension. But is it feature selection or feature extraction? Do the features are only SELECTED from the available features or are they engineered?
(Was asked in some test - had to choose from feature selection and extraction)
The tag wiki for data-reduction states:
"In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction."
So:
But is it feature selection or feature extraction?
It is either one or the other.
Do the features are only SELECTED from the available features or are they engineered?
Again, I think the answer is either one or the other. (I don't know what you mean by "engineered" in this context.)
If this is not helping you understand, I suggest:
Ask a more detailed / specific question
Read the Wikipedia articles on:
Dimensionality Reduction
Feature Selection
Feature Extraction
and so on.

Feed large text to PyTextRank

I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.
No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities.
For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45
One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?
However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/
FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me #pacoid on Tw for details.

Boxcox transformation with tree-based models(XGBoost to be specific)

I have a question regarding boxcox transformation(or log transformation). I am working on a data-set which I have lots of skewed features. Now when I take the boxcox transformation, I get quite a nice distribution but the thing is correlation decrease. Now if I was working with linear models I would just consider correlation to decide I should transform the feature or not. But as I mentioned I am working with tree-based models, so should I transform the feature to get a more dispersed distribution or I leave the feature as it is to avoid a decrease in correlation.
I add a screenshot of distribution and its relationship with the target variable, for both transformed and not transformed(Left 2 plots original feature and target).
PS: Guessing from the plots, it seems to me that if I transform the feature it will be easier for tree to find a split for this particular feature.
Thanks a lot,

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

How to extract semantic relatedness from a text corpus

The goal is to assess semantic relatedness between terms in a large text corpus, e.g. 'police' and 'crime' should have a stronger semantic relatedness than 'police' and 'mountain' as they tend to co-occur in the same context.
The simplest approach I've read about consists of extracting IF-IDF information from the corpus.
A lot of people use Latent Semantic Analysis to find semantic correlations.
I've come across the Lucene search engine: http://lucene.apache.org/
Do you think it is suitable to extract IF-IDF?
What would you recommend to do what I'm trying to do, both in terms of technique and software tools (with a preference for Java)?
Thanks in advance!
Mulone
Yes, Lucene gets TF-IDF data. The Carrot^2 algorithm is an example of a semantic extraction program built on Lucene. I mention it since, as a first step, they create a correlation matrix. Of course, you probably can build this matrix yourself easily.
If you deal with a ton of data, you may want to use Mahout for the harder linear algebra parts.
It is very easy if you have lucene index. For example to get correllation you can use simple formula count(term1 and term2)/ count(term1)* count(term2). Where count is hits from you search results. Moreover you can easility calculate other semntica metrics such as chi^2, info gain. All you need is to get formula and convert it to terms of count from Query