train loss jump when using tfrecord file - tensorflow

My goal is to train a model with large datasets. But my RAM of PC is limited.
Here I use tfrecord file to save RAM when training. But I find out that the loss values of the model traind by the tfrecord file have the periodic jumps.
While the model trained by the numpy pickle keeps steady.
It should be noted that the models and data used in these two methods are same.

Related

Image decoded/encoded/decoded in Tensorflow is not the same as the original

I store images in tfrecord files for training an image classification model with Tensorflow 2.10. TFrecords are read in a dataset on which I apply the fit() function. After training I'm making an inference with
image from dataset
same image read from disk
I notice that the predictions are not the same because in the first case, image encompasses (in the process of writing then reading the tfrecord file to build the dataset) a decode/encode/decode transformation (TF functions tf.io.decode_jpeg and tf.io.encode_jpeg) that is not symmetrical: image after transformation is not the same as original image (even if I encode with quality=100).
It can make a difference: in the first case, the good class is not in the top-3. In the second case, yes.
Is there any way to avoid this asymmetrical behavior?

Issues in padding(pre-processing) of huggingface gpt2 transformer model and issues with very large dataset during model training

Objective: I am trying to train a Tensorflow Huggingface GPT2 model (language model training from scratch)
Model Description:
Huggingface GPT2 Tensorflow Model
Attached a pic of config. Model Config
Dataset Description:
I have a large dataset (~20GB),
the data is separated into multiple text files with each new line as a training example.
I am facing two issues.
The examples are of different length and I am not sure how to make all the example sizes of same length to feed to the model.
Solutions Tried: We can either pad them, but then I am not sure how to do that in batches in Tensorflow. I searched about data-collator
Doubt: Padding would have to be done to make all the examples of equal size in the batch or across the whole dataset. And would this be with tokens or some other information. (Different Data Collators for Language Modelling etc.)
Since the data is very large, it cannot be loaded in memory at once while training. (Doing model.fit). For that I am not sure how to proceed.
Solutions: I am thinking of training and saving the model on small files but that would require manual intervention or for looping and the model would not be trained on the whole dataset in one go, so if there are other alternatives. Help would be really appreciated.

How to train large datasets in Colab free

I have to train 70,000 images for my face verification project on google colab free.
First, it gets stuck on 1st epoch and then even if it starts training, after sometime it throws out of RAM error.
The code I use is:
<https://nbviewer.org/github/nicknochnack/FaceRecognition/blob/main/Facial%20Verification%20with%20a%20Siamese%20Network%20-%20Final.ipynb>
If I've to make mini-batches of my dataset to fit it in the colab's GPU memory, then how can I do it?
Also, I want to train the whole dataset because it contains the images of 5 different people as anchors and positives.
You can do following options to train larger datasets.
Add more pooling layers in model.
Lower input size in your model.
Use Binary Format of images with lower image size for image classification models.
Lower the batch size while training and validating your model.
You can also use tf.data api to do various operations like batching , slicing , processing, shuffling etc to create a data pipeline. You can constrain GPU usage further to avoid Out of memory issues.
Attaching sample colab notebook below. https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/guide/data.ipynb

CNN model for deployment: how to optimize

Its my first time deploying a model. I've created a cnn model using tensorflow, keras, Xception and saved model is about 80 mb. When I load it from a function and do a prediction, it takes about 4-5 seconds. Is there a way to reduce this time? Does the model has to be loaded for every prediction?
enter image description here
The model load only once in your program. for each prediction, you use the loaded model. it might take time to predict. TensorFlow doesn't load the model on prediction. the better way is to only save weights after training and for inference create model architecture and then load the saved weights.

model saver meta file is huge, can I configure tensorflow to not write it?

I am doing transfer learning in tensorflow with vgg16. I am only training one small layer on top of the 500MB of weights that I got from the numpy npz file on that website. When I save my model, I am specifying just the weights I am training, so the model file is small - however the meta file is 500MB. From reading stackoverflow: what-is-the-tensorflow-checkpoint-meta-file it sounds safe to remove the meta files, but can I configure tensorflow to not write them?
You could try the tf.train.Saver.save with the optional write_meta_graph=False param.
https://www.tensorflow.org/versions/r0.11/api_docs/python/state_ops.html#Saver