mnist and cifar10 examples with TFRecord train/test file - tensorflow

I am a new user of Tensorflow. I would like to use it for training a dataset of 2M images. I did this experiment in caffe using lmdb file format.
After reading Tensorflow related posts, I realized that TFRecord is the most suitable file format to do so. Therefore, I am looking for complete CNN examples which use TFRecord data. I noticed that the image related tutorials (mnist and cifar10 in link1 and link2) are provided with a different binary file format where the entire data-set is loaded at once. Therefore, I would like to know if anyone knows if these tutorials (mnist and cifar10) are available using TFRecord data (for both CPU and GPU).

I assume that you want to both, write and read TFRecord files. I think what is done here reading_data.py should help you converting MNIST data into TFRecors.
For reading it back, this script does the trick: fully_connected_reader.py
This could be done similarly with cifar10.

Related

How to train custom object detection with tfrecord file

here I want to train a object detection model, so I have annotated the data using roboflow and then exported it as tfrecords and also got the (.pbtxt file) and after that I don't have any clue on how to train a can model from scratch with just 2,3 number of hidden layers. am not getting on how to use that tfrecord to fit in my model which I have created. please help me out.
tfrecord files are usually used with Tensorflow Object Detection. It's pretty old and I haven't seen it used in practice recently, but there's a Tensorflow Object Detection tutorial here that uses these tfrecord files.
If there's not a particular reason you need to use TF Object Detection I'd recommend using a newer and more well-supported model like YOLOv5 or YOLOv7.

Issues in padding(pre-processing) of huggingface gpt2 transformer model and issues with very large dataset during model training

Objective: I am trying to train a Tensorflow Huggingface GPT2 model (language model training from scratch)
Model Description:
Huggingface GPT2 Tensorflow Model
Attached a pic of config. Model Config
Dataset Description:
I have a large dataset (~20GB),
the data is separated into multiple text files with each new line as a training example.
I am facing two issues.
The examples are of different length and I am not sure how to make all the example sizes of same length to feed to the model.
Solutions Tried: We can either pad them, but then I am not sure how to do that in batches in Tensorflow. I searched about data-collator
Doubt: Padding would have to be done to make all the examples of equal size in the batch or across the whole dataset. And would this be with tokens or some other information. (Different Data Collators for Language Modelling etc.)
Since the data is very large, it cannot be loaded in memory at once while training. (Doing model.fit). For that I am not sure how to proceed.
Solutions: I am thinking of training and saving the model on small files but that would require manual intervention or for looping and the model would not be trained on the whole dataset in one go, so if there are other alternatives. Help would be really appreciated.

Huge size of TF records file to store on Google Cloud

I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

using tf.data or queues CNN text classification by tensorflow

I used CNN-Text Classification base on this github link https://github.com/dennybritz/cnn-text-classification-tf, While my dataset is too large with 10000 documents(Size: 120M).
For Efficient performance, I want to change the evaluation set to use a smaller subset of my data, or use Tensorflow queues or tf.data to read data sequentially. Now I don't know how can I solve this issue? and witch .py project in this package has to be changed?
Thanks.

How to feed the training examples one by one in tensorflow (e.g., cifar example)

Just want to feed the data according to the original order in tensorflow. Specifically, I am working on the cifar10 example. Do anyone knows how to guarantee that examples are fed exactly the same order as they are in the data file?