Huge size of TF records file to store on Google Cloud - tensorflow

I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.

I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

Related

CNTK - Faster RCNN Train with My Own Labels Data Set Can not Train More Than 20 Images

I'm working with CNTK Faster RCNN object detection and now I have been facing with problem.
To make you understand the problem, I will start with explain my work process from started.
First I follow by https://learn.microsoft.com/en-us/cognitive-toolkit/object-detection-using-faster-r-cnn
to install all of need package. I successful in the step. Then I try with grocery data set which is contain 20 images train (I'm using base model as AlexNet).
And the results is done. everything look work at this point.
Then I use VoTT to labels my dataset and I put it into data set folder of CNTK. I also use annotations_helper.py to generate other input files for prepare model training step.
After I create My_DataSet_config.py and change some configuration. I realize that I can not train my data set more than 20 image. Let's say if I train 30 images programs will error like gt_boxes is empty (it's really empty but with some specific images training number it's no longer empty).
So I try to follow some instruction I found on GitHub like the problem is image and annotation files, try to delete the image and run again.
I really done that but it's not solution on my case. If the number of data set for train still not 20 images, I will find the error again with any image. Please take a look. Thank you
Python 3.5
Windows
CNTK 2.7
Here is my data set configuration file.
enter image description here
Here is my model configuration file.
enter image description here

Migrating legacy ML pipeline to TFX

We are investigating transitioning our ML pipelines from a set of manual steps into a TFX pipeline.
I do however have some questions for which I would like to have some additional insights.
We typically perform the following steps (for an image classification task):
Load image data and meta-data
Filter out ‘bad’ data based on meta-data
Determine image based statistics (classic image processing in Python):
Image level characteristics
Image region characteristics
(region is determined based on a fine-tuned EfficientDet model)
Filter out ‘bad’ data based on image statistics
Generate TFRecords from this image and meta-data
Oversample certain TFRecords for class balancing (using tf.data)
Train an image classifier
…
Now, I’m trying to map this onto the typical example TFX pipeline.
This however raises a number of questions:
I see two options:
ExampleGen uses a CSV file containing pointers to the image to be loaded and the meta-data to be loaded (above step ‘1’). However:
If this CSV file contains a path to an image file, can ExampleGen then load the image data and add this to its output?
Is the output of ExampleGen a streaming output, or a dump of all example data?
ExampleGen has TFRecords as input (output of above step ‘5’)
-> This implies that we would still need to implement steps 1-5 outside of TFX… Which would decrease the value for TFX for us…
Could you please advice what would be the best way forward?
Can StatisticsGen also generate statistics on a per-example base (for example some image (region) characteristics based on classic image processing)? Or should this be implemented in ExampleGen? Or…?
Can the calculated statistics be cached using the metadata store? If yes, is there an example of this available?
Calculating image based characteristics using classic image processing is slow. If new data becomes available, triggering the TFX input component to be executed, ideally already calculated statistics should be loaded from the cache.
Is it correct that ExampleValidator may reject some examples (e.g. missing data, outliers, …)?
How can class balancing at the network input side (not via the loss function) be achieved in this setup (normally we do this by oversampling our TFRecords using tf.data)?
If this is done at the ExampleGen level, then the ExampleValidator may still reject some examples potentially unbalancing the data again.
This may not seem like a big issue for large data ML tasks, but it becomes crucial for small data ML tasks (as typically is the case in a healthcare setting).
So I would expect a TFX component for this before the Transform component, but this block should then have access to all data, not in a streaming way (see my earlier question on ExampleGen output)…
Thank you for your insights.
I'll try to address most questions with my experience with tfx.
I have a dataflow job that I run to pre-process my images, labels, features, etc and turn all that into tfrecords. That lives outside of tfx and is ran only when there is data refreshes.
You can do the same, here is a very simple code snippet that i use to resize all my images and create simple features.
try:
image = tf.io.decode_jpeg(image_string)
image = tf.image.resize(image,[image_resize_size,image_resize_size])
image = tf.image.convert_image_dtype(image/255.0, dtype=tf.uint8)
image_shape = image.shape
image = tf.io.encode_jpeg(image,quality=100)
feature = {
'height': _int64_feature(image_shape[0]),
'width' : _int64_feature(image_shape[1]),
'depth' : _int64_feature(image_shape[2]),
'label' : _int64_feature(labels_to_int(element[2].decode())),
'image_raw' : _bytes_feature(image.numpy())
}
tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
except:
print('image could not be decoded')
return None
Once I have the data in tfrecord format, I use the ImportExampleGen component to load the data into my tfx pipeline. This is followed by StatisticsGen which will compute statistics on the features.
When running all of this in the cloud, it is using dataflow under the covers in batch mode.
Your metadata store only caches pipeline metadata, but your data is cached in a gcs bucket and metadata store knows it. So when you re-run your pipeline, if you have caching set to True, your ImportExampleGen, StatisticsGen, SchemaGen, Transform will not be reran if the data hasn't changed. This has huge benefits in time and costs.
ExampleValidator outputs an artifact to let you know what data anomalies are in your data. I created a custom component that intakes the examplevalidator artifact and if my data doesn't meet certain criteria, I kill the pipeline by throwing an error in this component. I wish there was a component that can just stop the pipeline, but I haven't found one, so my work around was to throw an error which stops the pipeline from progressing further.
Usually when I create a tfx pipeline, it is done to automate the machine learning process. At this point we have already done class balancing, feature selection, etc. Since that falls more under the pre-processing stage.
I guess you could technically create a custom component that takes a StatisticsGen artifact, parses it and tries to do some class balancing and creates a new dataset with balances classes. But honestly, I think is better to do it at the preprocessing stage.

Accessing Cloud Storage from Cloud ML Engine during prediction

I am trying to build an image classifier, that will recognise the class for a test image based on a similiarity measure between the test image and a dataset of labeled images. Basically, I want to use a KNN classifier that takes as input the bottleneck features of a pretrained CNN model.
I would like to store this dataset of labeled images (the bottleneck features) in a seperate bucket in the Google Cloud Storage and give my model access to this dataset during prediction, since the file size of my saved model would be to big, when adding this dataset to the saved model (Google restricts the file size to 250MB). Unfortunately, I can't find a way to access a bucket from a SavedModel. Does anyone have any idea how to solve this?
Code running on the service currently only has access to public GCS buckets. You can contact us offline (cloudml-feedback#google.com) and we may be able to increase your file size quota.

Specifically, how to train neural network when it is larger than ram?

I have specific questions about how to train a neural network that is larger than ram. I want to use the de facto standard which appears to be Keras and tensorflow.
What are the key classes and methods that I need to use
From Numpy, to scipy, to pandas, h5py, to keras in order to not exceed my meager 8 gb of ram? I have time to train the model; I don't have cash. My dataset requires 200 GB of ram.
In keras there is a model_fit() method. It requires X and Y numpy arrays. How do I get it to accept hdf5 numpy arrays on disk? And when specifying the model architecture itself How do I save ram because wouldn't the working memory require > 8 gb at times?
Regarding fit_generator, does that accept hdf5 files? If the model_fit() method can accept hdf5, do I even need fit generator? It seems that you still need to be able to fit the entire model in ram even with these methods?
In keras does the model include the training data when calculating its memory requirements? If so I am in trouble I think.
In essence I am under the assumption that at no time can I exceed my 8 Gb of ram, whether from one hot encoding to loading the model to training on even a small batch of samples. I am just not sure how to accomplish this concretely.
I cannot answer everything, and I'm also very interested in those answers, because I'm facing that 8GB problem too.
I can only suggest how to pass little batches at a time.
Question 2:
I don't think Keras will support passing the h5py file (but I really don't know), but you can create a loop to load the file partially (if the file is properly saved for that).
You can create an outer loop to:
create a little array with only one or two samples from the file
use the method train_on_batch passing only that little array.
release the memory disposing of the array or filling this same array with the next sample(s).
Question 3:
Also don't know about the h5py file, is the object that opens the file a python generator?
If not, you can create the generator yourself.
The idea is to make the generator load only part of the file and yield little batch arrays with one or two data samples. (Pretty much the same as done in question 2, but the loop goes inside a generator.
Usually for very large sample sets an "online" training method is used. This means that instead of training your neural network in one go with a large batch, it allows the neural network to be updated incrementally as more samples are obtained. See: Stochastic Gradient Descent

mnist and cifar10 examples with TFRecord train/test file

I am a new user of Tensorflow. I would like to use it for training a dataset of 2M images. I did this experiment in caffe using lmdb file format.
After reading Tensorflow related posts, I realized that TFRecord is the most suitable file format to do so. Therefore, I am looking for complete CNN examples which use TFRecord data. I noticed that the image related tutorials (mnist and cifar10 in link1 and link2) are provided with a different binary file format where the entire data-set is loaded at once. Therefore, I would like to know if anyone knows if these tutorials (mnist and cifar10) are available using TFRecord data (for both CPU and GPU).
I assume that you want to both, write and read TFRecord files. I think what is done here reading_data.py should help you converting MNIST data into TFRecors.
For reading it back, this script does the trick: fully_connected_reader.py
This could be done similarly with cifar10.