Reusing transformations between training and predictions - tensorflow

I'd like to apply stemming to my training data set. I can do this outside of tensorflow as part of training data prep, but I then need to do the same process on prediction request data before calling the (stored) model.
Is there a way of implementing this transformation in tensorflow itself so the transformation is used for both training and predictions?
This problem becomes more annoying if the transformation requires knowledge of the whole dataset, normalisation for example.

Can you easily express your processing (e.g. stemming) as a tensorflow operation? If yes, then you can build your graph in a way that both your inputs and predictions can make use of the same set of operations. Otherwise, there isn't much harm in calling the same (non tensorflow) function for both pre-processing and for predictions.
Re normalisation: you would find the dataset statistics (means, variance, etc. depending on how exactly you are normalizing) and then hardcode them for the pre/post-processing so I don't think that's really an annoying case.

Related

Using dynamically generated data with keras

I'm training a neural network using keras but I'm not sure how to feed the training data into the model in the way that I want.
My training data set is effectively infinite, I have some code to generate training examples as needed, so I just want to pipe a continuous stream of novel data into the network. keras seems to want me to specify my entire dataset in advance by creating a numpy array with everything in it, but this obviously wont work with my approach.
I've experimented with creating a generator class based on keras.utils.Sequence which seems like a better fit, but it still requires me to specify a length via the __len__ method which makes me think it will only create that many examples before recycling them. Can someone suggest a better approach?

Strange algorithm selection when using Azure AutoML with XBoostClassifier on categorial data

I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.
TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines

What are the purposes of each step in train-evaluate-predict in tensorflow?

What do each of the stages do? I understand that for neural nets in nlp, the train will find the best parameters for the word embedding. But what is the purpose of the evaluation step? What is it supposed to do? How is that different from the prediction phase?
Training, evaluation and prediction are the three main steps of training a model ( basically in any ML framework ) and to move a model from research/development to production.
Training:
A suitable ML architecture is selected based on the problem which needs to be solved. Hyperparameter optimization is carried out to fine-tune the model. The model is then trained on the data for a certain number of epochs. Metrics such as loss, accuracy, MSE are monitored.
Evaluation:
We need to move the model to production. The model in the production
stage will only make inferences and hence we require the best model
possible. So, in order to evaluate or test the model based on some
predefined levels, the evaluation phase is carried out.
Evaluation is mostly carried out on the data which is a subset of the original dataset. Training and evaluations splits are made while preprocessing the data. Metrics are calculated in order to check the performance of the model on the evaluation dataset.
The evaluation data has been never seen by the model as it is not trained on it. Hence, the model's best performance is expected here.
Prediction:
After the testing of the model, we can move it to production. In the production phase, models only make an inference ( predictions ) on the data given to them. No training takes place here.
Even after a thorough examination, the model tends to make
mispredictions. Hence, in the production stage, we can receive
interactive feedback from the users about the performance of the
model.
Now,
But what is the purpose of the evaluation step? What is it supposed to
do? How is that different from the prediction phase?
Evaluation is to make the model better for most cases through which it will come across. Predictions are made to check for other problems which are not related to performance.

Tensorflow: how to restore only specific hidden layers from checkpoint and use them to build a different computational graph for inference?

Let's say I trained a model with a very complex computational graph tailored for training. After a lot of training, the best model was saved to a checkpoint file. Now, I want to use the learned parameters of this best model for inference. However, the computational graph used for training is not exactly the same as the one I intend to use for inference. Concretely, there is a module in the graph with several layers in charge of outputting embedding vectors for items (recommender system context). However, for the sake of computational performance, during inference time I would like to have all the item embedding vectors precomputed in advance, so that the only computation required per request would just involve a couple of hidden layers.
Therefore, what I would like to know how to do is:
How to just restore the part of the network that outputs item embedding vectors, in order to precompute these vectors for all items (this would happen in some pre-processing script off-line)
Once all item embedding vectors are precomputed, during on-line inference time how to just restore the hidden layers in the later parts of the network and make them receive the precomputed item embedding vectors instead.
How can the points above be accomplished? I think point 1. is easier to get done. But my biggest concern is with point 2. In the computational graph used for training, in order to evaluate any layer I would have to provide values for the input placeholders. However, during on-line inference these placeholders would be obsolete because a lot of stuff would be precomputed and I don't know how to tell hidden layers in the later parts of the network that they should no longer depend on these obsolete placeholders but depend on the precomputed stuff instead.

Tensorflow - How to ignore certain labels

I'm trying to implement a fully convolutional network and train it on the Pascal VOC dataset, however after reading up on the labels in the set, I see that I need to somehow ignore the "void" label. In Caffe their softmax function has an argument to ignore labels, so I'm wondering what the mechanic is, so I can implement something similar in tensorflow.
Thanks
In tensorflow you're feeding the data in feed_dict right? Generally you'd want to just pre-process the data and remove the unwanted samples - don't give them to tensorflow for processing.
My prefered approach is a producer-consumer model where you fire up a tensorflow queue and load it with samples from a loader thread which just skips enqueuing your void samples.
In training your model dequeue samples in the model (you don't use feed_dict in the optimize step). This way you're not bothering to write out a whole new dataset with the specific preprocessing step you're interested in today (tomorrow you're likely to find you want to do some other preprocessing step).
As a side comment, I think tensorflow is a little more do-it-yourself than some other frameworks. But I tend to like that, it abstracts enough to be convenient, but not so much that you don't understand what's happening. When you implement it you understand it, that's the motto that comes to mind with tensorflow.