I am interested in calibrating a binary probabilistic classifier in TFX. I was about to try doing it in standard Python externally to TFX, but then I found this piecewise linear calibration layer.
The description is a bit cryptic to me. Is this layer the sort of thing one could stack to the output layer of a TFX model and calibrate the output using recent y_true and y_pred?
If not, is there a standard way to do calibration in TFX?
Calibration of the data should be done prior to the data the transformation and classification.
The piecewise data is only applicable when the data coincides to regions of observed data.
We are not given enough information to properly answer this question.
Related
I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.
TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines
In Tensorflow, you can either perform either classification or linear regression to train your inputs against the labels. Is it possible to perform some classification for your inputs (as pre-processing but not necessarily to use Tensorflow) and determine if you want to run the linear regression using Tensorflow?
For example in image denoising task, you have found that your linear regression algorithm can provide a good smoothing effect against the edges but in the meantime also remove the details for the texture objects. Therefore you would like to perform a binary classification to determine if an input is a texture object, and run the linear regression algorithm using Tensorflow; otherwise do nothing for texture object.
I understand Tensorflow supports transfer learning so I guess one of the possible solutions is to perform binary classification using Tensorflow, and transfer the "texture classification" knowledge to instruct Tensorflow to apply linear regression algorithm only when the input is a texture object? Please correct me if I am wrong as I am not too sure if the above task is do-able in Tensorflow (it would be great if you can describe how to do this in details if this is do-able :-) ).
I guess an alternative solution is to use some binary classification without Tensorflow, and filter out (remove) the texture inputs before passing them to Tensorflow.
Please kindly tell me if which of the above solution (or any other solution) is better (if do-able) for the above scenario? Any suggestions are welcome.
I am using Weisfeiler-Lehman Graph Kernels from here to get the precomputed kernel for the Scikit learn SVM see description.
At test time, what should be the format of my data? I'm really confused about that. See dimension requirements.
Thanks very much.
I was reading about an activity recognition paper https://arxiv.org/pdf/1705.07750.pdf. Here, they use 3D convolution on inception v1 to perform activity recognition. I was listening to a talk that said visualizing embedding space of the features from the video.
1) What does it mean to visualize an embedding space? Are you looking at the filters that it has learnt or are you looking for clusterings of similar activities?
2) Do you just visualize the weight matrix for seeing the features that it is capturing? If yes, which weight matrix?
3)Does tf.summary.image() help in visualizing the weight matrix?
The embedding space is the space of the features produced by some learning algorithm. In the specific case of a (convolutional) neural network, this usually means one of the output feature maps (flattened) at some predefined layer or the output of one of the fully connected layers.
What one would visualize is not the weight matrix, but the values of the produced features for some input test data. For example one takes the full test set and passes it through the network and computes the features for each image at a specific layer, and then visualizes those values.
TensorBoard has functionality to automatically visualize embeddings and other feature spaces, you should take a look at it.
Note that in some application contexts like NLP an embedding has a slightly different definition but the use is the same.
I am building a logistic regression model in tensorflow to approximate a function.
When I randomly select training and testing data from the complete dataset, I get a good result like so (blue are training points; red are testing points, the black line is the predicted curve):
But when I select the spatially seperate testing data, I get terrible predicted curve like so:
I understand why this is happening. But shouldn't a machine learning model learn these patterns and predict new values?
Similar thing happens with a periodic function too:
Am I missing something trivial here?
P.S. I did google this query for quite some time but was not able to get a good answer.
Thanks in advance.
What you are trying to do here is not related to logistic regression. Logistic regression is a classifier and you are doing regression.
No, machine learning systems aren't smart enough to learn to extrapolate functions like you have here. When you fit the model you are telling it to find an explanation for the training data. It doesn't care what the model does outside the range of training data. If you want it to be able to extrapolate then you need to give it extra information. You could set it up to assume that the input belonged to a sine wave or a quadratic polynomial and have it find the best fitting one. However, with no assumptions about the form of the function you won't be able to extrapolate.