I'm trying to train a Tensorflow model with a very large dataset (much larger than my memory).
To fully utilize all the available training data, I'm thinking about separating them into several small "shards" and train on one shard at a time.
After a little research, I found this method is often referred to as "incremental learning". And based on this Wiki page, NOT all algorithms support incremental learning.
I'm building my model using tf.keras.Model. In this case, is incremental learning possible?
Tensorflow & Keras models support incremental learning by default - in fact, we routinely use incremental learning in cases of transfer learning, among others. You just fit different portions of your data to the model sequentially - you can even save your model, and then load it and continue training with the same or different parts of your data.
For more info see:
Incremental learning in keras (SO thread)
Transfer learning & fine-tuning (Keras dosumentation)
Online/Incremental Learning with Keras and Creme
Can I train a neural network incrementally given new daily data? (AI SE thread)
Related
Objective: I am trying to train a Tensorflow Huggingface GPT2 model (language model training from scratch)
Model Description:
Huggingface GPT2 Tensorflow Model
Attached a pic of config. Model Config
Dataset Description:
I have a large dataset (~20GB),
the data is separated into multiple text files with each new line as a training example.
I am facing two issues.
The examples are of different length and I am not sure how to make all the example sizes of same length to feed to the model.
Solutions Tried: We can either pad them, but then I am not sure how to do that in batches in Tensorflow. I searched about data-collator
Doubt: Padding would have to be done to make all the examples of equal size in the batch or across the whole dataset. And would this be with tokens or some other information. (Different Data Collators for Language Modelling etc.)
Since the data is very large, it cannot be loaded in memory at once while training. (Doing model.fit). For that I am not sure how to proceed.
Solutions: I am thinking of training and saving the model on small files but that would require manual intervention or for looping and the model would not be trained on the whole dataset in one go, so if there are other alternatives. Help would be really appreciated.
for my ML project I want to use the faster_rcnn_resnet101_kitti model from tensorflow model zoo. As the number of images in the Kitti dataset is extremely small (about 7000 images) for a deep learning practice, I was wondering how this small amount of data leads to the decent inference performance (mAP#0.5=87)? One answer I can imagine is that the network was first trained on a different, rich dataset and fine tuned on the Kitti but I am not sure about it.
I am wondering how can I find out the exact underlying training procedure (apart from pipeline.config) for the models published on TF model zoo?
Thanks
I followed all the steps mentioned in the article:
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Then I compared the results with Linear Regression and found that the error is less (68) than the tensorflow model (84).
from sklearn.linear_model import LinearRegression
logreg_clf = LinearRegression()
logreg_clf.fit(X_train, y_train)
pred = logreg_clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, pred)))
Does this mean that if I have large dataset, I will get better results than linear regression?
What is the best situation - when I should be using tensorflow?
Answering your first question, Neural Networks are notoriously known for overfitting on smaller datasets, and here you are comparing the performance of a simple linear regression model with a neural network with two hidden layers on the testing data set, so it's not very surprising to see that the MLP model falling behind (assuming that you are working with relatively a smaller dataset) the linear regression model. Larger datasets will definitely help neural networks in learning more accurate parameters and generalize the phenomena well.
Now coming to your second question, Tensorflow is basically a library for building deep learning models, so whenever you are working on a deep learning problem like image recognition, Natural Language Processing, etc. you need massive computational power and will be processing a ton of data to train your models, and this is where TensorFlow becomes handy, it offers you GPU support which will significantly boost your training process which otherwise becomes practically impossible. Moreover, if you are building a product that has to be deployed in a production environment for it to be consumed, you can make use of TensorFlow Serving which helps you to take your models much closer to the customers.
I have a large frame and used h2o flow run automl with a deep learning algo. However, the training metrics are calculated on a “temporary sample frame”. I could not find any info to this. I am not sure if the automl has been run on the full frame or just thus temp frame. Can someone help to understand or give a pointer? BTW, I don’t find this feature convenient.
This is a special case for Deep Learning models and is not the case for any other models produced by the AutoML process. For efficiency reasons (and since H2O is designed for very large datasets), the training metrics in Deep Learning models are calculated on a subset of the original training frame.
There is a parameter in the H2O Deep Learning algorithm called score_training_samples that defaults to 10,000 rows (and since we do approximate sampling, also for efficiency reasons, it makes sense that the actual subset size is 9,993).
This should be a good approximation for training error. The only way to change this in Flow would be to train a Deep Learning model manually (outside the AutoML process).
I have a question related to this one:
TensorFlow in production for real time predictions in high traffic app - how to use?
I want to setup TensorFlow Serving to do inference as a service for our other application. I see how TensorFlow Serving helps me to do that. Additionally, it mentions a continuous training pipeline, which probably is related to the possibility that TensorFlow Serving can serve with multiple versions of a trained model. But what I am not sure is how to retrain your model as you get new data. The other post mentions the idea to run retraining with cron jobs. However, I am not sure if automatic retraining is a good idea. What architecture would you propose for a continuous retraining pipeline with a system continuously facing new, labelled data?
Edit: It is a supervised learning case. The question is would you automatically retrain your model after n new datapoints came in or would you retrain during the downtime of the customer automatically or just retrain manually?
You probably want to use some kind of semi-supervised training. There's fairly extensive research in that area.
A crude, but expedient way, which works well, is to use the current best models that you have to label the new, incoming data. Models are typically able to produce a score (hopefully a logprob). You can use that score to only train on the data that fits well.
That is an approach that we have used in speech recognition and is an excellent baseline.