Strange algorithm selection when using Azure AutoML with XBoostClassifier on categorial data - xgboost

I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.

TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines


Online vs Offline data augmentation

I can do online or "on the fly" Image augmentation layers or using Image Data Generator
Or I can do augmentation and the save the relevant images on hard disk.
What is the advantages / disadvantages of each approach ?
I don't consider the storage issue.
Also offline approach can provide a "double check" option to ensure, that all augmentation are done is it was planned.
May be check out this NVIDIA post:
"Online augmentation in the training data loader is a good way to increase the variation in the dataset. However, the augmented data is generated randomly based on the distribution the data loader follows when sampling the data. In order to achieve good accuracy, the model may need to be trained for a long time. In order to circumvent this and generate a dataset with the required augmentations (Offline augmentation can be used). Offline augmentation can dramatically increase the size of the dataset when collecting and labeling data is expensive or not possible."
I think, your last sentence is right on point.

AutoML select the model manually

It is been a while I am looking for the best pipeline to do some classification using AutoML. But I want to know if it is possible to select the model manually and then just optimize its hyperparameters. For example, I want to just optimize SVM's hyperparameters and don't care about other models.
You can optimize only the selected model in MLJAR AutoML. It is open-source AutoML with code available at GitHub:
The example code will look like:
automl = AutoML(algorithms=["Xgboost"], mode="Compete"), y)
The above code will tune only the Xgboost algorithm. The mode Compete is needed because the MLJAR AutoML can work in three modes: Explain, Perform, and Compete. Algorithms available in MLJAR AutoML: Baseline, Linear, Random Forest, Extra Trees, Decision Tree, Neural Networks, Nearest Neighbors, Xgboost, LightGBM, CatBoost.
I'm the author of MLJAR AutoML, I'll be happy to help you set it and run.

H2o flow automl temporary sample frame

I have a large frame and used h2o flow run automl with a deep learning algo. However, the training metrics are calculated on a “temporary sample frame”. I could not find any info to this. I am not sure if the automl has been run on the full frame or just thus temp frame. Can someone help to understand or give a pointer? BTW, I don’t find this feature convenient.
This is a special case for Deep Learning models and is not the case for any other models produced by the AutoML process. For efficiency reasons (and since H2O is designed for very large datasets), the training metrics in Deep Learning models are calculated on a subset of the original training frame.
There is a parameter in the H2O Deep Learning algorithm called score_training_samples that defaults to 10,000 rows (and since we do approximate sampling, also for efficiency reasons, it makes sense that the actual subset size is 9,993).
This should be a good approximation for training error. The only way to change this in Flow would be to train a Deep Learning model manually (outside the AutoML process).

Reusing transformations between training and predictions

I'd like to apply stemming to my training data set. I can do this outside of tensorflow as part of training data prep, but I then need to do the same process on prediction request data before calling the (stored) model.
Is there a way of implementing this transformation in tensorflow itself so the transformation is used for both training and predictions?
This problem becomes more annoying if the transformation requires knowledge of the whole dataset, normalisation for example.
Can you easily express your processing (e.g. stemming) as a tensorflow operation? If yes, then you can build your graph in a way that both your inputs and predictions can make use of the same set of operations. Otherwise, there isn't much harm in calling the same (non tensorflow) function for both pre-processing and for predictions.
Re normalisation: you would find the dataset statistics (means, variance, etc. depending on how exactly you are normalizing) and then hardcode them for the pre/post-processing so I don't think that's really an annoying case.

Can Tensorflow Wide and Deep model train to continuous values

I am working with the Tensorflow Wide and Deep model. It currently trains against a binary classification (>50K or not).
Can this model be coerced to train directly against numeric values to produce more precise (if less accurate) predictions?
I have seen an example of using LSTM RNNs to make such predictions using TensorFlowEstimator directly here, but DNNLinearCombinedClassifier will not accept n_classes=0.
I like the structure of the Wide and Deep model, especially the ability to run the linear regression and the DNN separately to determine how learnable the data is, but my application involves data that clusters, but in an overlapping, input-dependent fashion.
Use DnnLinearCombinedRegressor for regression problems.