I am trying to use tensorflow slim but I am stuck at the beggining which is how to load the data. In fact, I find only one way on the internet : TFrecord. First, I am wondering if this format can be used for all types of data (images, signals, etc) and second if there is another way to load data (for example in keras we can load them using numpy and create some mini batch) ?
Related
I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!
What is the main difference between .pb format of tensorflow and .h5 format of keras to store models? Is there any reason to choose one over the other?
Different file formats with different characteristics, both used by tensorflow to save models (.h5 specifically by keras).
.pb - protobuf
It is a way to store some structured data (in this case a neural network),project is open source and currently overviewed by Google.
Example
person {
name: "John Doe"
email: "jdoe#example.com"
}
Simple class containing two fields, you can load it in one of multiple supported languages (e.g. C++, Go), parse, modify and send to someone else in binary format.
Advantages
really small and efficient to parse (when compared to say .xml), hence often used for data transfer across the web
used by Tensorflow's Serving when you want to take your model to production (e.g. inference over the web)
language agnostic - binary format can be read by multiple languages (Java, Python, Objective-C, and C++ among others)
advised to use since tf2.0 , you can see official serializing guide
saves various metadata (optimizers, losses etc. if using keras's model)
Disadvantages
SavedModel is conceptually harder to grasp than single file
creates folder where weights are
Sources
You can read about this format here
.h5 - HDF5 binary data format
Used originally by keras to save models (keras is now officially part of tensorflow). It is less general and more "data-oriented", less programmatic than .pb.
Advantages
Used to save giant data (so some neural networks would fit well)
Common file saving format
Everything saved in one file (weights, losses, optimizers used with keras etc.)
Disadvantages
Cannot be used with Tensorflow Serving but you can simply convert it to .pb via keras.experimental.export_saved_model(model, 'path_to_saved_model')
All in all
Use the simpler one (.h5) if you don't need to productionize your model (or it's reasonably far away). Use .pb if you are going for production or just want to standardize on single format across all tensorflow provided tools.
I know how to load a pre-trained image models from Tensorflow Hub. like so:
#load model
image_module = hub.Module('https://tfhub.dev/google/imagenet/mobilenet_v2_035_128/feature_vector/2')
#get predictions
features = image_module(batch_images)
I also know how to customize the output of this model (fine-tune on new dataset). The existing Modules expect input batch_images to be a RGB image tensor.
My question: Instead of the input being a RGB image of certain dimensions, I would like to use a tensor (dim 20x20x128, from a different model) as input to the Hub model. This means I need to by-passing the initial layers of the tf-hub model definition (i don't need them). Is this possible in tf-hub module api's? Documentation is not clear on this aspect.
p.s.: I can do this easily be defining my own layers but trying to see if i can use the Tf-Hub API's.
The existing https://tfhub.dev/google/imagenet/... modules do not support this.
Generally speaking, the hub.Module format allows multiple signatures (that is, combinations of input/output tensors; think feeds and fetches as in tf.Session.run()). So module publishers can arrange for that if there is a common usage pattern they want to support.
But for free-form experimentation at this level of sophistication, you are probably better off directly using and tweaking the code that defines the models, such as TF Slim (for TF1.x) or Keras Applications (also for TF2). Both provide Imagenet-pretrained checkpoints for downloading and restoring on the side.
Does any one know of high level api if exist in java for directly reading tfrecords and feeding to a tensorflow savedModel . Python api allows both example.proto (tfrecords) and tensors to be fed to tf model for inference. The only api i have seen in java is of creating raw tensors, is there a way similar to python sdk where i can directly feed tfrecords (example.proto_ to a saved model bundle in java as well.
I just came across same scenario and I used TFRecordIO from Java Apache Beam to read records. For example,
pipeline
.apply(TFRecordIO.read().from(dataPath))
.apply(ParDo.of(new ModelEvaluationFn()));
Inside ModelEvaluationFn I do the scoring using savedModel. With Java Apache Beam, you can run locally, on GCP Dataflow, Spark, Flink, etc. But if you are using Spark directly, there's spark-tensorflow-connector.
Another thing I came across is how to parse the tfrecords in Java because I need get the label value and group by using some column values to get breakdown scores. org.tensorlfow/proto package can help you do that. Here are examples: example1, example2. Essentially, it's Example.parseFrom(byte[]).
I would like to get all the scalar summary data in Tensorflow summary.
TensorBoard web GUI only allows me to download 1,000 records of the summary. Is there any way I can get every single scalar data that I've put in the summary?
If you have access to the data you can load it up in python, as an alternative to downloading via tensorboard. See How to export tensor board data? how to do this.