I see in explanation a TFRecord contains multiple classes and multiple images (a cat and a bridge). When it was written, both images are written into one TFRecord. During the read back, it is verified that this TFRecord contains two images.
Elsewhere I have seen people generating one TFRecord per image, I know you can load multiple TFRecord files like this:
train_dataset = tf.data.TFRecordDataset("<Path>/*.tfrecord")
But which way is recommended? should I build one tfrecord per image, or one tfrecord for multiple images? If put multiple images into one tfrecord, then how many is maximum?
As you said, it is possible to save an arbitrary amount of entries in a single TFRecord file, and one can create as many TFRecord files as desired.
I would recommend using practical considerations to decide how to proceed:
On one hand, try to use fewer TFRecord files for easier handling moving files in the filesystem
On the other hand, avoid growing TFRecord files to a size that can become a problem for filesystem
Keep in mind that it is useful to keep separate TFRecord files for train / validation / test split
Sometimes the nature of the dataset makes it obvious how to split into separate files (for example, I have a video dataset where I use one TFRecord file per participant session)
Related
I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!
I want to train my word2vec models on the hpc cluster provided through my university. However, I have been told that in order to optimize storage on the cluster, I must transform my data into HDF5 and upload that data instead into the cluster. My data consists of txt files (basically the txt files I want to train word2vec on). How am I supposed to transform txt files into HDF5 ?
I am surfing the documentation but cannot seem to find a tool for txt files, or should I write a certain script ?
I have trained Two models and generated their detect.tflite files successfully, I need to know that , Is there any way to merge both detect.tflite file so that resulting one detect file can be used in android/ios application?
I did quiet decent research on this and came to conclusion that two .tflite file cannot be merged, however one can combine datasets and retrain model and generate new .tflite file which can do job of both previous .tflite files
I'm in a situation where the input into my ML model is a variable number of images per example (but only 1 label for each set), and so I would like to be able to pack multiple images into a single TFRecord example. However, every example i come across online is single image, single label, which is understandable because that's the most common use-case. I also wonder about decoding...it appears that tf.image.decode_png only does one image at a time, but perhaps I can convert all the images to tf.string and use tf.decode_raw, then resize to get all the images?
Thanks
Why TFRecords file is sharded in the inception model example in TensorFlow ?
For randomness, can't the list of files be shuffled before creating one single TFRecord file ?
Why TFRecords file is sharded in the inception model example in TensorFlow ?
According to object detection API, there are two advantages in sharding your dataset:
Files can be read in parallel, improving data loading speed
Examples can be shuffled better by sharding
You probably already knew the second point as it is in your second question:
For randomness, can't the list of files be shuffled before creating one single TFRecord file ?
Shuffling the dataset before creating the record is indeed a good practice because shuffling a TFRecord can only be done partially. Indeed, you can only load a certain number of examples in memory. The shuffling is then done by selecting randomly your next example among the ones in memory. You can see more in this question
However, if you only shuffle the dataset when creating the record, your network will always see examples in the same order in the successive training epochs. This might result in unwanted convergence behaviours due to the random order that was given once and for all. It is thus more interesting to shuffle the dataset on the fly to have different orderings in different epochs.
Sharding your dataset can easen the shuffling. Instead to be forced to always read in the same order from the same one file, you can start to read a bit from each file, choosing at random.