What is Dimensionality Reduction ? Feature Selection or extraction - data-science

In my knowledge, DR is a technique that transforms high dimensional data into lower dimension. But is it feature selection or feature extraction? Do the features are only SELECTED from the available features or are they engineered?
(Was asked in some test - had to choose from feature selection and extraction)

The tag wiki for data-reduction states:
"In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction."
So:
But is it feature selection or feature extraction?
It is either one or the other.
Do the features are only SELECTED from the available features or are they engineered?
Again, I think the answer is either one or the other. (I don't know what you mean by "engineered" in this context.)
If this is not helping you understand, I suggest:
Ask a more detailed / specific question
Read the Wikipedia articles on:
Dimensionality Reduction
Feature Selection
Feature Extraction
and so on.

Related

Boxcox transformation with tree-based models(XGBoost to be specific)

I have a question regarding boxcox transformation(or log transformation). I am working on a data-set which I have lots of skewed features. Now when I take the boxcox transformation, I get quite a nice distribution but the thing is correlation decrease. Now if I was working with linear models I would just consider correlation to decide I should transform the feature or not. But as I mentioned I am working with tree-based models, so should I transform the feature to get a more dispersed distribution or I leave the feature as it is to avoid a decrease in correlation.
I add a screenshot of distribution and its relationship with the target variable, for both transformed and not transformed(Left 2 plots original feature and target).
PS: Guessing from the plots, it seems to me that if I transform the feature it will be easier for tree to find a split for this particular feature.
Thanks a lot,

feature selection and estimate the documents similarity in text mining

I'm working on a text mining project by WEKA library in Java. In the preprocessing step I applied StringToWordVector filter. In this filter, I set several options like tokenizing, stop words removing, stemming, and TF-IDF weighting scheme.
I have some questions:
1- is it necessary to do a feature selection process in every text mining projects?
2- is it necessary to estimate the similarity of documents, for example: by using Cosine similarity?
or these two options are optional?
and is StringToWordVector filter does some of these?
It is not necessary. Nobody imposes you that step. But results usually improve with appropriate feature selection methods.
It is necesary if that is a goal of your project; it is not imposed by any means. The StringToWordVector filter only does that, convert your strings into wordVectors for further processing or analysis. It is up to you what you calculate from your data. If you need a similarity measure, then, cosine distance is a suitable measure.

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

Looking for ideas/references/keywords: adaptive-parameter-control of a search algorithm (online-learning)

I'm looking for ideas/experiences/references/keywords regarding an adaptive-parameter-control of search algorithm parameters (online-learning) in combinatorial-optimization.
A bit more detail:
I have a framework, which is responsible for optimizing a hard combinatorial-optimization-problem. This is done with the help of some "small heuristics" which are used in an iterative manner (large-neighborhood-search; ruin-and-recreate-approach). Every algorithm of these "small heuristics" is taking some external parameters, which are controlling the heuristic-logic in some extent (at the moment: just random values; some kind of noise; diversify the search).
Now i want to have a control-framework for choosing these parameters in a convergence-improving way, as general as possible, so that later additions of new heuristics are possible without changing the parameter-control.
There are at least two general decisions to make:
A: Choose the algorithm-pair (one destroy- and one rebuild-algorithm) which is used in the next iteration.
B: Choose the random parameters of the algorithms.
The only feedback is an evaluation-function of the new-found-solution. That leads me to the topic of reinforcement-learning. Is that the right direction?
Not really a learning-like-behavior, but the simplistic ideas at the moment are:
A: A roulette-wheel-selection according to some performance-value collected during the iterations (near past is more valued than older ones).
So if heuristic 1 did find all the new global best solutions -> high probability of choosing this one.
B: No idea yet. Maybe it's possible to use some non-uniform random values in the range (0,1) and i'm collecting some momentum of the changes.
So if heuristic 1 last time used alpha = 0.3 and found no new best solution, then used 0.6 and found a new best solution -> there is a momentum towards 1
-> next random value is likely to be bigger than 0.3. Possible problems: oscillation!
Things to remark:
- The parameters needed for good convergence of one specific algorithm can change dramatically -> maybe more diversify-operations needed at the beginning, more intensify-operations needed at the end.
- There is a possibility of good synergistic-effects in a specific pair of destroy-/rebuild-algorithm (sometimes called: coupled neighborhoods). How would one recognize something like that? Is that still in the reinforcement-learning-area?
- The different algorithms are controlled by a different number of parameters (some taking 1, some taking 3).
Any ideas, experiences, references (papers), keywords (ml-topics)?
If there are ideas regarding the decision of (b) in a offline-learning-manner. Don't hesitate to mention that.
Thanks for all your input.
Sascha
You have a set of parameter variables which you use to control your set of algorithms. Selection of your algorithms is just another variable.
One approach you might like to consider is to evolve your 'parameter space' using a genetic algorithm. In short, GA uses an analogue of the processes of natural selection to successively breed ever better solutions.
You will need to develop an encoding scheme to represent your parameter space as a string, and then create a large population of candidate solutions as your starting generation. The genetic algorithm itself takes the fittest solutions in your set and then applies various genetic operators to them (mutation, reproduction etc.) to breed a better set which then become the next generation.
The most difficult part of this process is developing an appropriate fitness function: something to quantitatively measure the quality of a given parameter space. Your search problem may be too complex to measure for each candidate in the population, so you will need a proxy model function which might be as hard to develop as the ideal solution itself.
Without understanding more of what you've written it's hard to see whether this approach is viable or not. GA is usually well suited to multi-variable optimisation problems like this, but it's not a silver bullet. For a reference start with Wikipedia.
This sounds like hyper heuristics which you're trying to do. Try looking for that keyword.
In Drools Planner (open source, java) I have support for tabu search and simulated annealing out the box.
I haven't implemented the ruin-and-recreate-approach (yet), but that should be easy, although I am not expecting better results. Challenge: Prove me wrong and fork it and add it and beat me in the examples.
Hyper heuristics are on my TODO list.