Issues in padding(pre-processing) of huggingface gpt2 transformer model and issues with very large dataset during model training - tensorflow

Objective: I am trying to train a Tensorflow Huggingface GPT2 model (language model training from scratch)
Model Description:
Huggingface GPT2 Tensorflow Model
Attached a pic of config. Model Config
Dataset Description:
I have a large dataset (~20GB),
the data is separated into multiple text files with each new line as a training example.
I am facing two issues.
The examples are of different length and I am not sure how to make all the example sizes of same length to feed to the model.
Solutions Tried: We can either pad them, but then I am not sure how to do that in batches in Tensorflow. I searched about data-collator
Doubt: Padding would have to be done to make all the examples of equal size in the batch or across the whole dataset. And would this be with tokens or some other information. (Different Data Collators for Language Modelling etc.)
Since the data is very large, it cannot be loaded in memory at once while training. (Doing model.fit). For that I am not sure how to proceed.
Solutions: I am thinking of training and saving the model on small files but that would require manual intervention or for looping and the model would not be trained on the whole dataset in one go, so if there are other alternatives. Help would be really appreciated.

Related

How to control the amount of the augmented training data with the augmentation layers integrated into the model

As you may know, recent versions of tensorflow/keras allowed the data augmentation layers integrated into the model. This feature of the API is an excellent option, especially when you want to apply image augmentation on a part of inputs (image) for a model with multimodal inputs and different sub-networks for different inputs. And the test accuracy with this augmentation increased to 3-5% in comparison with no augmentation.
But I can't figure out how many training samples were used in the actual training with this augmentation method. For simplicity, let's assume I am passing a list of numpy arrays as the inputs of the model when fitting the model. For example, if I have 1000 training cases for a model with the augmentation layers, will 1000 training cases with transformed images be used in training? If not, how many?
I tried to search all related sites (tutorials and documentation) for an answer to this simple question in vain.
I think I found the answer. Based on the training log of the model, the augmentation layers do not produce additional images but randomly transform the original images. To increase generated data amount, a user has to provide multiple copies of original training data as input to the model.

How to create a dataset for image classification

I trained a model using images I gathered from the web. Then, when inferences were made using images newly collected from the web, performance was poor.
I am wondering how I can improve my dataset using misclassified images. Can I add all the misclassified images to the training dataset? And then do I have to collect new images?
[Edit]
I added some of the misclassified images to the training dataset, although the performance evaluation got better.
It might be worth if you could provide more info on how you trained your model, and your network architecture.
However this are some general guidelines:
You can try to diversify your images in your train set by, yes, adding new images. The more different examples you provide to your network, the higher the chance that they will be similar to images you want to obtain prediction from.
Do data augmentation, it is pretty straightforward and usually improves quite a bit the accuracy. You can have a look at this Tensorflow tutorial for Data Augmentation. If you don’t know what data augmentation is, basically is a technique to perform minor changes to your images, that is by rotating the image a bit, resizing etc. This way the model is trained to learn your images even with slight changes, which usually makes it more robust to new images.
You could consider doing Transfer Learning. The main idea here is to leverage a model that has learned on a huge dataset and use it to fine-tune your specific problem. In the tutorial I linked they show the typical workflow of transfer learning, by taking a model pretrained on the ImageNet dataset (the huge dataset), and retraining it on the Kaggle "cats vs dogs" classification dataset (a smaller dataset, like the one you could have).

Variation in total loss while training the Faster RCNN model using customized data

I am working on object detection model to identify two classes. I am using Faster RCNN on customized dataset in tensorflow api. The dataset contains 20k images (augmented) with two classes. While training the model the loss is not decreasing properly as it reach to 100k steps. It has lot of variation as shown in image. Can someone tell me where i am making mistake.
enter image description here

TensorFlow Kitti-trained models: Detailed underlying training procedure

for my ML project I want to use the faster_rcnn_resnet101_kitti model from tensorflow model zoo. As the number of images in the Kitti dataset is extremely small (about 7000 images) for a deep learning practice, I was wondering how this small amount of data leads to the decent inference performance (mAP#0.5=87)? One answer I can imagine is that the network was first trained on a different, rich dataset and fine tuned on the Kitti but I am not sure about it.
I am wondering how can I find out the exact underlying training procedure (apart from pipeline.config) for the models published on TF model zoo?
Thanks

NER Incremental training with Spacy

I would like to incrementally train a NER Spacy Model.
By incrementally I mean send a first batch of N training samples, get a first model, then send a second batch of M training samples and get a model identical as if the N+M samples would have been sent in one batch and the model trained.
To be clear, this is not about adding samples after the model has been fully trained. Instead it is the ability to save intermediate states in the model so we can "resume" and add more training samples.
This is very useful if the number of samples is large or to create an "active learning" systems.
It seems doable with NLTK according to this article : and I was wondering if this can be done with Spacy.
So far I have trained my own custom NER model with Spacy using nlp.update but it does not seem to store any intermediate state that supports incremental training.
Yes, this is possible in spaCy. Your approach with nlp.update is correct; once you have added your second batch of training samples, you just need to make a call to nlp.to_disk("/path") (https://spacy.io/usage/saving-loading). Then you can continue this process by loading your saved model again.