Is the number in RFECV gridscore equivalent to the selected features? - pandas

I am seeking some clarity surrounding the number associated with selector.grid_scores_ in RFECV.
I have used the following:
from sklearn.feature_selection import RFECV
estimator_RFECV = ExtraTreesClassifier(random_state=0)
estimator_RFECV = RFECV(estimator_RFECV, min_features_to_select = 20, step=1, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
estimator_RFECV = estimator_RFECV.fit(X_train, y_train)
Using estimator_RFECV.ranking_, 27 features are selected through CV, however, when I look at estimator_RFECV.grid_scores_, at 27, the value here (accuracy) is not the highest. Am I interpreting the grid_scores_ incorrect and I should not expect 27 to have the highest accuracy?

Here, estimator_RFECV.ranking_ will give you an array of feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, feature ranked 2 will be less important than ranked 1 and so on.
So estimator_RFECV.ranking_ will give us ranking of features or we can say respective importance of feature.
However, estimator_RFECV.grid_scores_ will give us array based on scoring metrics, min_features_to_select and Maximum number of feature available. In the above case it should contain 8 elements each representing Accuracy with top X features where X belongs to 20 to 27.
And yes, it's always possible that model with lesser number of feature can have higher accuracy, because some features which we may have considered that were irrelevant.
Also, the RFECV documentation link from the official documentation could be helpful.

Related

How to remove frequent/infrequent features from Sklearn CountVectorizer?

Is it possible to remove a percentage of features that occur most frequently / infrequently, from the CountVectorizer?
So basically organize the features from a greatest to least occurrence distribution and just remove a percentage from the left or right side?
As far as I know, there is no straight forward way to do that.
Let me propose a way to achieve the result you want.
I will assume that you are only interested in unigrams (one-word features) to make the examples also simpler.
Regarding the top-x per cent of the features, a possible implementation can be based on the max_features parameter of the CountVectorizer (see user guide).
First, you would need to find out the total number of features by using the CountVectorizer with the default values so that it generates the full vocabulary of terms in the corpus.
vect = CountVectorizer()
bow = vect.fit_transform(corpus)
total_features = len(vect.vocabulary_)
Then you use the CountVectorizer with the max_features parameter, limiting the number of features to the top percentage you need, say 20%. When using the max_features the most frequent terms are selected automatically.
top_vect = CountVectorizer(max_features=int(total_features * 0.2))
top_bow = top_vect.fit_transform(corpus)
Now, regarding the bottom-x per cent of the features, even though I cannot think a good reason why you need that, here is an approach. The vocabulary parameter can be used to limit the model to count only the less frequent terms. For that, we use the output of the first run of the CountVectorizer to create a list of the less common terms.
# Create a list of (term, frequency) tuples sorted by their frequency
sum_words = bow.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1])
# Keep only the terms in a list
vocabulary, _ = zip(*words_freq[:int(total_features * 0.2)])
vocabulary = list(vocabulary)
Finally, we use the vocabulary to limit the model to the less frequent terms.
bottom_vect = CountVectorizer(vocabulary=vocabulary)
bottom_bow = bottom_vect.fit_transform(corpus)

What is "valency", with regards to machine learning?

This term came up a few times in the Tensorflow Dev Summit, and it shows up in the Tensorflow Extended documentation, but without any sort of definition. After a fair amount of googling, I don't see reference to it in any Statistics-related setting. Searching the Tensorflow repositories produces a few hits, but they're similarly unhelpful. The term does seem to be used in Chemistry, Psychology, and Linguistics, but those definitions appear to be unrelated.
Per the 2017 TFX paper http://stevenwhang.com/tfx_paper.pdf, TFX can calculate a number of stats on a dataset, including:
"The expected valency of the feature in each example, i.e., minimum
and maximum number of values."
We can also look at the code that calculates valency in TFX. From what I can tell, it's designed to run on a feature that is an array, and counts the minimum and maximum number of values within that array for that feature:
# Extract the valency information of the feature.
valency = ''
if feature.HasField('value_count'):
if (feature.value_count.min == feature.value_count.max and
feature.value_count.min == 1):
valency = 'single'
else:
min_value_count = ('[%d' % feature.value_count.min
if feature.value_count.HasField('min') else '[0')
max_value_count = ('%d]' % feature.value_count.max
if feature.value_count.HasField('max') else 'inf)')
valency = min_value_count + ',' + max_value_count
from: https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/utils/display_util.py#L68
As discussed in this blog,
Valency indicates the number of values required per training example.
In the case of categorical features, single indicates that each
training example must have exactly one category for the feature.
More broadly speaking, it applies to features with multiple values (IMHO not that common for features in Machine Learning), e.g., lists and arrays. In that case, valency refers to the minimum or maximum number of values in these data types. For lists, one can compute the valency by applying np.min()/np.max() on the list lengths from all available feature examples.
After experimenting with both numerical and categorical features, it turns out that there only appears values (e.g., "single") in the "Valency" column when the value in the corresponding "Presence" column is "optional" (tfdv 1.6.0).

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Variable size multi-label candidate sampling in tensorflow?

nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)
I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.

What is the output of XGboost using 'rank:pairwise'?

I use the python implementation of XGBoost. One of the objectives is rank:pairwise and it minimizes the pairwise loss (Documentation). However, it does not say anything about the scope of the output. I see numbers between -10 and 10, but can it be in principle -inf to inf?
good question. you may have a look in kaggle competition:
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
It gives predicted score for ranking.
However, the scores are valid for ranking only in their own groups.
So we must set the groups for input data.
For esay ranking, refer to my project xgboostExtension
If I understand your questions correctly, you mean the output of the predict function on a model fitted using rank:pairwise.
Predict gives the predicted variable (y_hat).
This is the same for reg:linear / binary:logistic etc. The only difference is that reg:linear builds trees to Min(RMSE(y, y_hat)), while rank:pairwise build trees to Max(Map(Rank(y), Rank(y_hat))). However, output is always y_hat.
Depending on the values of your dependent variables, output can be anything. But I typically expect output to be much smaller in variance vs the dependent variable. This is usually the case as it is not necessary to fit extreme data values, the tree just needs to produce predictors that are large/small enough to be ranked first/last in the group.