It is been a while I am looking for the best pipeline to do some classification using AutoML. But I want to know if it is possible to select the model manually and then just optimize its hyperparameters. For example, I want to just optimize SVM's hyperparameters and don't care about other models.
You can optimize only the selected model in MLJAR AutoML. It is open-source AutoML with code available at GitHub: https://github.com/mljar/mljar-supervised
The example code will look like:
automl = AutoML(algorithms=["Xgboost"], mode="Compete")
automl.fit(X, y)
The above code will tune only the Xgboost algorithm. The mode Compete is needed because the MLJAR AutoML can work in three modes: Explain, Perform, and Compete. Algorithms available in MLJAR AutoML: Baseline, Linear, Random Forest, Extra Trees, Decision Tree, Neural Networks, Nearest Neighbors, Xgboost, LightGBM, CatBoost.
I'm the author of MLJAR AutoML, I'll be happy to help you set it and run.
Related
I ask a question about optimizing the trained function of a PyTorch-written deep neural network (data-driven optimization) here, but it looks like there isn't any solution for it.
In my previous effort, I trained a DL networked in PyTorch, exported the output function using torch.jit.trace, and tried to optimize the trained output function with Pyomo, But it didn't work.
Now, I want to ask what other alternative framework combination (DL framework + optimization framework) can I use to first train my network and then optimize the output trained function without any problem?
It should be noted that my training data are some physical properties (such as temperature, pressure, etc.) therefore I am facing a regression problem.
I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.
TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines
I followed all the steps mentioned in the article:
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Then I compared the results with Linear Regression and found that the error is less (68) than the tensorflow model (84).
from sklearn.linear_model import LinearRegression
logreg_clf = LinearRegression()
logreg_clf.fit(X_train, y_train)
pred = logreg_clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, pred)))
Does this mean that if I have large dataset, I will get better results than linear regression?
What is the best situation - when I should be using tensorflow?
Answering your first question, Neural Networks are notoriously known for overfitting on smaller datasets, and here you are comparing the performance of a simple linear regression model with a neural network with two hidden layers on the testing data set, so it's not very surprising to see that the MLP model falling behind (assuming that you are working with relatively a smaller dataset) the linear regression model. Larger datasets will definitely help neural networks in learning more accurate parameters and generalize the phenomena well.
Now coming to your second question, Tensorflow is basically a library for building deep learning models, so whenever you are working on a deep learning problem like image recognition, Natural Language Processing, etc. you need massive computational power and will be processing a ton of data to train your models, and this is where TensorFlow becomes handy, it offers you GPU support which will significantly boost your training process which otherwise becomes practically impossible. Moreover, if you are building a product that has to be deployed in a production environment for it to be consumed, you can make use of TensorFlow Serving which helps you to take your models much closer to the customers.
I have a large frame and used h2o flow run automl with a deep learning algo. However, the training metrics are calculated on a “temporary sample frame”. I could not find any info to this. I am not sure if the automl has been run on the full frame or just thus temp frame. Can someone help to understand or give a pointer? BTW, I don’t find this feature convenient.
This is a special case for Deep Learning models and is not the case for any other models produced by the AutoML process. For efficiency reasons (and since H2O is designed for very large datasets), the training metrics in Deep Learning models are calculated on a subset of the original training frame.
There is a parameter in the H2O Deep Learning algorithm called score_training_samples that defaults to 10,000 rows (and since we do approximate sampling, also for efficiency reasons, it makes sense that the actual subset size is 9,993).
This should be a good approximation for training error. The only way to change this in Flow would be to train a Deep Learning model manually (outside the AutoML process).
I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks. However, I'm not sure which one should I pick to start with. Which of these recent models in NLP such as original Transformer model, Bert, GPT2, XLNet would you use to start with? And why? I'd rather to implement in Tensorflow, but I'm flexible to go for PyTorch too.
Thanks!
It highly depends on your dataset and is part of the data scientist's job to find which model is more suitable for a particular task in terms of selected performance metric, training cost, model complexity etc.
When you work on the problem you will probably test all of the above models and compare them. Which one of them to choose first? Andrew Ng in "Machine Learning Yearning" suggest starting with simple model so you can quickly iterate and test your idea, data preprocessing pipeline etc.
Don’t start off trying to design and build the perfect system.
Instead, build and train a basic system quickly—perhaps in just a few
days
According to this suggestion, you can start with a simpler model such as ULMFiT as a baseline, verify your ideas and then move on to more complex models and see how they can improve your results.
Note that modern NLP models contain a large number of parameters and it is difficult to train them from scratch without a large dataset. That's why you may want to use transfer learning: you can download pre-trained model and use it as a basis and fine-tune it to your task-specific dataset to achieve better performance and reduce training time.
I agree with Max's answer, but if the constraint is to use a state of the art large pretrained model, there is a really easy way to do this. The library by HuggingFace called pytorch-transformers. Whether you chose BERT, XLNet, or whatever, they're easy to swap out. Here is a detailed tutorial on using that library for text classification.
EDIT: I just came across this repo, pytorch-transformers-classification (Apache 2.0 license), which is a tool for doing exactly what you want.
Well like others mentioned, it depends on the dataset and multiple models should be tried and best one must be chosen.
However, sharing my experience, XLNet beats all other models so far by a good margin. Hence if learning is not the objective, i would simple start with XLNET and then try a few more down the line and conclude. It just saves time in exploring.
Below repo is excellent to do all this quickly. Kudos to them.
https://github.com/microsoft/nlp-recipes
It uses hugging face transformers and makes them dead simple. 😃
I have used XLNet, BERT, and GPT2 for summarization tasks (English only). Based on my experience, GPT2 works the best among all 3 on short paragraph-size notes, while BERT performs better for longer texts (up to 2-3 pages). You can use XLNet as a benchmark.