Accessing Cloud Storage from Cloud ML Engine during prediction - tensorflow

I am trying to build an image classifier, that will recognise the class for a test image based on a similiarity measure between the test image and a dataset of labeled images. Basically, I want to use a KNN classifier that takes as input the bottleneck features of a pretrained CNN model.
I would like to store this dataset of labeled images (the bottleneck features) in a seperate bucket in the Google Cloud Storage and give my model access to this dataset during prediction, since the file size of my saved model would be to big, when adding this dataset to the saved model (Google restricts the file size to 250MB). Unfortunately, I can't find a way to access a bucket from a SavedModel. Does anyone have any idea how to solve this?

Code running on the service currently only has access to public GCS buckets. You can contact us offline (cloudml-feedback#google.com) and we may be able to increase your file size quota.

Related

monitor cpu, gpu, memory usage in colab (pro)

I want to track the usage of the above resources while training a model with pyspark. Is there any built in method from colab, I have purchased colab pro, in order to store those numbers in a txt file? The goal is to train a model to predict these values, so we need a big amount of data, so monitoring by the graphs on the right hand side is not an option.
I have also tried using wandb, but couldn't make sense of it, so if someone has a tutorial i would be grateful.

How to train large datasets in Colab free

I have to train 70,000 images for my face verification project on google colab free.
First, it gets stuck on 1st epoch and then even if it starts training, after sometime it throws out of RAM error.
The code I use is:
<https://nbviewer.org/github/nicknochnack/FaceRecognition/blob/main/Facial%20Verification%20with%20a%20Siamese%20Network%20-%20Final.ipynb>
If I've to make mini-batches of my dataset to fit it in the colab's GPU memory, then how can I do it?
Also, I want to train the whole dataset because it contains the images of 5 different people as anchors and positives.
You can do following options to train larger datasets.
Add more pooling layers in model.
Lower input size in your model.
Use Binary Format of images with lower image size for image classification models.
Lower the batch size while training and validating your model.
You can also use tf.data api to do various operations like batching , slicing , processing, shuffling etc to create a data pipeline. You can constrain GPU usage further to avoid Out of memory issues.
Attaching sample colab notebook below. https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/guide/data.ipynb

Migrating legacy ML pipeline to TFX

We are investigating transitioning our ML pipelines from a set of manual steps into a TFX pipeline.
I do however have some questions for which I would like to have some additional insights.
We typically perform the following steps (for an image classification task):
Load image data and meta-data
Filter out ‘bad’ data based on meta-data
Determine image based statistics (classic image processing in Python):
Image level characteristics
Image region characteristics
(region is determined based on a fine-tuned EfficientDet model)
Filter out ‘bad’ data based on image statistics
Generate TFRecords from this image and meta-data
Oversample certain TFRecords for class balancing (using tf.data)
Train an image classifier
…
Now, I’m trying to map this onto the typical example TFX pipeline.
This however raises a number of questions:
I see two options:
ExampleGen uses a CSV file containing pointers to the image to be loaded and the meta-data to be loaded (above step ‘1’). However:
If this CSV file contains a path to an image file, can ExampleGen then load the image data and add this to its output?
Is the output of ExampleGen a streaming output, or a dump of all example data?
ExampleGen has TFRecords as input (output of above step ‘5’)
-> This implies that we would still need to implement steps 1-5 outside of TFX… Which would decrease the value for TFX for us…
Could you please advice what would be the best way forward?
Can StatisticsGen also generate statistics on a per-example base (for example some image (region) characteristics based on classic image processing)? Or should this be implemented in ExampleGen? Or…?
Can the calculated statistics be cached using the metadata store? If yes, is there an example of this available?
Calculating image based characteristics using classic image processing is slow. If new data becomes available, triggering the TFX input component to be executed, ideally already calculated statistics should be loaded from the cache.
Is it correct that ExampleValidator may reject some examples (e.g. missing data, outliers, …)?
How can class balancing at the network input side (not via the loss function) be achieved in this setup (normally we do this by oversampling our TFRecords using tf.data)?
If this is done at the ExampleGen level, then the ExampleValidator may still reject some examples potentially unbalancing the data again.
This may not seem like a big issue for large data ML tasks, but it becomes crucial for small data ML tasks (as typically is the case in a healthcare setting).
So I would expect a TFX component for this before the Transform component, but this block should then have access to all data, not in a streaming way (see my earlier question on ExampleGen output)…
Thank you for your insights.
I'll try to address most questions with my experience with tfx.
I have a dataflow job that I run to pre-process my images, labels, features, etc and turn all that into tfrecords. That lives outside of tfx and is ran only when there is data refreshes.
You can do the same, here is a very simple code snippet that i use to resize all my images and create simple features.
try:
image = tf.io.decode_jpeg(image_string)
image = tf.image.resize(image,[image_resize_size,image_resize_size])
image = tf.image.convert_image_dtype(image/255.0, dtype=tf.uint8)
image_shape = image.shape
image = tf.io.encode_jpeg(image,quality=100)
feature = {
'height': _int64_feature(image_shape[0]),
'width' : _int64_feature(image_shape[1]),
'depth' : _int64_feature(image_shape[2]),
'label' : _int64_feature(labels_to_int(element[2].decode())),
'image_raw' : _bytes_feature(image.numpy())
}
tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
except:
print('image could not be decoded')
return None
Once I have the data in tfrecord format, I use the ImportExampleGen component to load the data into my tfx pipeline. This is followed by StatisticsGen which will compute statistics on the features.
When running all of this in the cloud, it is using dataflow under the covers in batch mode.
Your metadata store only caches pipeline metadata, but your data is cached in a gcs bucket and metadata store knows it. So when you re-run your pipeline, if you have caching set to True, your ImportExampleGen, StatisticsGen, SchemaGen, Transform will not be reran if the data hasn't changed. This has huge benefits in time and costs.
ExampleValidator outputs an artifact to let you know what data anomalies are in your data. I created a custom component that intakes the examplevalidator artifact and if my data doesn't meet certain criteria, I kill the pipeline by throwing an error in this component. I wish there was a component that can just stop the pipeline, but I haven't found one, so my work around was to throw an error which stops the pipeline from progressing further.
Usually when I create a tfx pipeline, it is done to automate the machine learning process. At this point we have already done class balancing, feature selection, etc. Since that falls more under the pre-processing stage.
I guess you could technically create a custom component that takes a StatisticsGen artifact, parses it and tries to do some class balancing and creates a new dataset with balances classes. But honestly, I think is better to do it at the preprocessing stage.

Huge size of TF records file to store on Google Cloud

I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

How to save a tensorflow model trained in google datalab notebook for offline prediction?

I am using Google Cloud Datalab notebook to train my tensorflow model. I want to save the trained model for offline prediction. However, I am clueless on how to save the model. Should I use any tensorflow model saving method or is there any datalab/google cloud storage specific method to do so? Any help in this regard is highly appreciated.
You can use any tensorflow model saving method, but I would suggest that you save it into a Google Cloud Storage bucket and not to local disk. Most tensorflow methods accept Google Cloud Storage paths in place of file names, using the gs:// prefix.
I would suggest using the SavedModelBuilder as it is currently the most portable. There is an example here: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/model.py#L393