Xgboost node splits on a value that out of feature range? - xgboost

I have some features that range from 0 to 1.
But when I dump the model, I find some nodes split those features using "feature < 2.00001".
Does xgboost scale the feature or add some value to the feature? Or why 2.00001 is chosen to split?
Thanks~

xgboost has separate splits based on values of the feature and whether or not the feature is missing.
This is usually when xgboost wants to split only on whether or not the feature is missing, and not based on the value.

Related

Confusion about how bucketized feature columns work

I had some confusion about how bucketized feature columns represent input to the model. According to the blog post on feature columns, when we bucketize a feature like year this puts each value in buckets based on the defined boundaries, and creates a binary vector, turning on each bucket based on the input value, but the example in the documentation shows the output as a single integer. I'm confused as to how the input is to the model when using a bucketized column. Can anyone clarify this for me please?
From the dimensions of the first hidden layer of the estimator, it seems like for each feature column that is a tf.feature_column.bucketized_column, a one hot encoded vector is created based on the boundaries.

using H2O flow XGboost model

It gives a regression prediction as continuous score with negative values, like -1.27544 < x < 6.68112. How I interpret the negatives?
If you are using an H2O algorithm to predict a binary target (0/1), unless you convert your target column to a factor using (.asfactor() in python or as.factor() in R), H2O will assume this column is numeric and will solve a regression problem.
please verify the data type of your target column (it will likely show integer) and make sure that it shows enum.
more informations about your target distribution choices can be found here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/distribution.html

A similar approach for LabelEncoder in sklearn.preprocessing?

For encoding categorical data like sex we normally use LabelEncorder() in scikit learn. But If I'm going to use Tensorflow instead of Scikit Learn, what is the equivalent function or methodology for doing such task? I know that we can do one hot encoding easily with tensorflow, but then it will create labels as 10 , 01 instead of 1 , 0.
There is a package in TensorFlow called tf.feature_columns, that contain 4 methods to create categorical columns from your input data:
categorical_column_with_hash_bucket(...): Hash the input value to a fixed number of categories
categorical_column_with_identity(...): If you have numeric input and you want the value itself to be treated as a categorical column
categorical_column_with_vocabulary_list(...): Outputs a category based on a fixed (memory) list of words
categorical_column_with_vocabulary_file(...): Same as _list but reads the vocabulary from file
The package also provides lots more way of getting your input data to the model. For an overview, see this blogpost written by the developers of the package.

How to identify relevant features in WEKA?

I would like to perform feature analysis in WEKA. I have a data set of 8 features and 65 instances.
I would like to perform feature selection and optimization functionalities that are available for machine learning methods like SVM.
For example in Weka I would like to know how I can display which of the features contribute best to the classification result.
I think that WEKA provides a nice graphical user interface and allows a very detailed analysis of the influence of single features. But I dont know how to use it. Any help?
You have two options:
You can perform attribute selection using filters. For instance you can use the AttributeSelection tab (or filter) with the search method Ranker and the attribute evaluation metric InfoGainAttributeEval. This way you get a ranked list of the most predictive features according to its Information Gain score. I have done this many times with good results. Sometimes it helps even to increase the accuracy of SVMs, which are known not to need (too much) of feature selection. You can try with other search methods in order to find subgroups of coupled predictors, and with other metrics.
You can just look at the coefficients in the SVM output. For instance, in linear SVMs, the classifier is a polynomial like a1.f1 + a2.f2 + ... + an.fn + fn+1 > 0, being ai the attribute values for an instance, and fi the "weights" obtained in the SVM training algorithm. In consequence, those weights with values close to 0 represent attributes that do not count too much, thus being bad predictors; extreme weights (either positive or negative) represent good predictors.
Additionally, you can check the visualization options available for a particular classifier (e.g. J48 is a decision tree, the attribute used in the root test is for the best predictor). You can check the AttributeSelection tab visualization options as well.

Continuous prediction in Google Prediction API?

Is there any announcement about when Google will launch continuous prediction. Currently is there any trick to predict stock prices using Google's prediction API?
They announced continuous output for v1.1 today, along with the much requested multiple category output:
training data submitted with only numbers to v1.1 in the leftmost column will be treated as a continuous output problem (unlike v1)
...
numerical values in the leftmost column of all rows will
automatically return regression values. if you intend to do classification,
we recommend encasing those values within double quotes. For example, 5
indicates a regression value of 5 while "5" indicates a category labeled
"5."
Yes, it can be written.
The important factor affecting the accuracy of your predictions would be the input parameters that you give.
So, try to vary the input training data between different moving averages or other statistical figures and see what comes close at predicting the action to be taken (Buy/Sell).