Training from remote resources - tensorflow

All,
I've researched this some and haven't found a clear answer anywhere.
Using Keras with TF backend, how can you train a model using assets (like images for example) that are not local, but remote assets.
For example, if you have 1M images on s3 that are labeled but not organized by folder, is there a practical way to stream in data in a way Keras can use to train a model?
My thinking is that I would supply a file that was of the format:
{ label: "Apple", img: http://someurl/img.jpg }
{ label: "Banana", img: http://someurl/img.jpg }
{ label: "Orange", img: http://someurl/img.jpg }
You could use preprocessing.load_img or pillow to grab and resize the url.
This question is more about the correct process for this and the feasibility?

This would be possible by mirroring Keras' generator API. You can make a standard python generator that has an index of image URLs, and yields batches of images loaded from those URLs.
However, I would not recommend this approach. Loading images from the web introduces extra latency, which has the potential to significantly slow down your model's training. The only case where it might be a good idea is if you literally do not have the space on your SSD to store the whole dataset, and/or you find that the time it takes to load a batch of images images is short compared to the time it takes to train on that batch.

Related

Migrating legacy ML pipeline to TFX

We are investigating transitioning our ML pipelines from a set of manual steps into a TFX pipeline.
I do however have some questions for which I would like to have some additional insights.
We typically perform the following steps (for an image classification task):
Load image data and meta-data
Filter out ‘bad’ data based on meta-data
Determine image based statistics (classic image processing in Python):
Image level characteristics
Image region characteristics
(region is determined based on a fine-tuned EfficientDet model)
Filter out ‘bad’ data based on image statistics
Generate TFRecords from this image and meta-data
Oversample certain TFRecords for class balancing (using tf.data)
Train an image classifier
…
Now, I’m trying to map this onto the typical example TFX pipeline.
This however raises a number of questions:
I see two options:
ExampleGen uses a CSV file containing pointers to the image to be loaded and the meta-data to be loaded (above step ‘1’). However:
If this CSV file contains a path to an image file, can ExampleGen then load the image data and add this to its output?
Is the output of ExampleGen a streaming output, or a dump of all example data?
ExampleGen has TFRecords as input (output of above step ‘5’)
-> This implies that we would still need to implement steps 1-5 outside of TFX… Which would decrease the value for TFX for us…
Could you please advice what would be the best way forward?
Can StatisticsGen also generate statistics on a per-example base (for example some image (region) characteristics based on classic image processing)? Or should this be implemented in ExampleGen? Or…?
Can the calculated statistics be cached using the metadata store? If yes, is there an example of this available?
Calculating image based characteristics using classic image processing is slow. If new data becomes available, triggering the TFX input component to be executed, ideally already calculated statistics should be loaded from the cache.
Is it correct that ExampleValidator may reject some examples (e.g. missing data, outliers, …)?
How can class balancing at the network input side (not via the loss function) be achieved in this setup (normally we do this by oversampling our TFRecords using tf.data)?
If this is done at the ExampleGen level, then the ExampleValidator may still reject some examples potentially unbalancing the data again.
This may not seem like a big issue for large data ML tasks, but it becomes crucial for small data ML tasks (as typically is the case in a healthcare setting).
So I would expect a TFX component for this before the Transform component, but this block should then have access to all data, not in a streaming way (see my earlier question on ExampleGen output)…
Thank you for your insights.
I'll try to address most questions with my experience with tfx.
I have a dataflow job that I run to pre-process my images, labels, features, etc and turn all that into tfrecords. That lives outside of tfx and is ran only when there is data refreshes.
You can do the same, here is a very simple code snippet that i use to resize all my images and create simple features.
try:
image = tf.io.decode_jpeg(image_string)
image = tf.image.resize(image,[image_resize_size,image_resize_size])
image = tf.image.convert_image_dtype(image/255.0, dtype=tf.uint8)
image_shape = image.shape
image = tf.io.encode_jpeg(image,quality=100)
feature = {
'height': _int64_feature(image_shape[0]),
'width' : _int64_feature(image_shape[1]),
'depth' : _int64_feature(image_shape[2]),
'label' : _int64_feature(labels_to_int(element[2].decode())),
'image_raw' : _bytes_feature(image.numpy())
}
tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
except:
print('image could not be decoded')
return None
Once I have the data in tfrecord format, I use the ImportExampleGen component to load the data into my tfx pipeline. This is followed by StatisticsGen which will compute statistics on the features.
When running all of this in the cloud, it is using dataflow under the covers in batch mode.
Your metadata store only caches pipeline metadata, but your data is cached in a gcs bucket and metadata store knows it. So when you re-run your pipeline, if you have caching set to True, your ImportExampleGen, StatisticsGen, SchemaGen, Transform will not be reran if the data hasn't changed. This has huge benefits in time and costs.
ExampleValidator outputs an artifact to let you know what data anomalies are in your data. I created a custom component that intakes the examplevalidator artifact and if my data doesn't meet certain criteria, I kill the pipeline by throwing an error in this component. I wish there was a component that can just stop the pipeline, but I haven't found one, so my work around was to throw an error which stops the pipeline from progressing further.
Usually when I create a tfx pipeline, it is done to automate the machine learning process. At this point we have already done class balancing, feature selection, etc. Since that falls more under the pre-processing stage.
I guess you could technically create a custom component that takes a StatisticsGen artifact, parses it and tries to do some class balancing and creates a new dataset with balances classes. But honestly, I think is better to do it at the preprocessing stage.

Huge size of TF records file to store on Google Cloud

I am trying to modify a tensorflow project so that it becomes compatible with TPU.
For this, I started with the code explained on this site.
Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model.
I wanted to modify this code so that it supports TPU.
For this, I added the mandatory code for TPU as per this link.
Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.
Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.
Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.
At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay.
Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.
However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.
The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.
May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.
I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.
You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.
Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

Is there any way to modify available TensorFlow models architecture (such as ssd or fast r-cnn) so it is optimized for only one object detection?

I am new to Machine Learning and TensorFlow so I'm sorry and please correct me if my understanding is wrong. I have this project, developing a real time traffic-light detection with TensorFlow.
I've been working with pre-trained TensorFlow models such as SSD Mobilenet and Faster R-CNN Resnet. However, the expected accuracy result have not yet reached. I already considered to add some more data to the dataset (my dataset contains +/-1000 images), but because it is more work to add more data (since I have to do another data taking and label all images), which could take days. I want to consider another option.
Is there any way to modify TensorFlow models architecture so I could optimize and make it focused for only traffic light detection? I've been looking through TensorFlow models folder and could not find in which file these model architectures defined.
Any help will be appreciated. Thank you
What you need is fine-tuning a pretrained model on your traffic light dataset. Given you already have about 1000 images for just one class, this is a descent dataset.
In order to perform fine-tuning, there are some important steps to do.
First you need to transform your data into tfrecord format. Follow this tutorial to generate tfrecord files. This is actually a difficult step.
Create a label_map.pbtxt for your model, since it is only traffic light, what you need is this. Here are sample label_map files.
item {
name: "traffic-light"
id: 1
display_name: "traffic-light"
}
Then you need to prepare a pipeline config file for the model. Since you want to have a real-time detector, I suggest you use SSD-mobilenet models. Some sample config files are available here. You can take one of this sample configs and modify some fields to get the pipeline config for your model.
Suppose you choose ssd_mobilenet, then you can modify this config file, ssd_mobilenet_v2_coco.config. Specifically you need to modify these fields:
num_classes: you need to change from 90 to 1 since you only have traffic lights to detect.
input_path: (in both train_input_reader and eval_input_reader), point this to the tfrecord file you created.
label_map_path: (in both train_input_reader and eval_input_reader), point this to the label_map file you created.
fine_tune_checkpoint: set this path to a downloaded pretrained ssd-mobilenet model.
Depending on your training results, you may need to further adjust some of the fields in the config file, but after the training your model will focus only on traffic light class and will likely have a high accuracy.
All the specific tutorials can be found on the repo site. If you have any further questions, you can ask on stackoverflow with tag: object-detection-api and a lot people would help.

False positives in faster-rcnn object detection

I'm training an object detector using tensorflow and the faster_rcnn_inception_v2_coco model and am experiencing a lot of false positives when classifying on a video.
After some research I've figured out that I need to add negative images to the training process.
How do I add these to tfrecord files? I used the csv to tfrecord file code provided in the tutorial here.
Also it seems that ssd has a hard_example_miner in the config that allows to configure this behaviour but this doesn't seem to be the case for faster rcnn? Is there a way to achieve something similar on faster rcnn?
I was facing the same issue with faster RCNN, although you cannot actually use hard_example_miner with the faster RCNN model, you can add some background images, ie. images with no objects (Everything remains the same, except there is not object tag in the xml for that particular picture)
One more thing that actually worked wonders for me was using the imgaug library, you can augment the images and the bounding boxes using the same script. Try and increase the training data by 10 or 15 times, and then I would suggest you to train again to around 150000-200000 steps.
These two steps helped me reduce the number of false positives effectively.

tensorflow object api faster_rcnn_resnet101 training image resizing

I am currently using the Tensorflow Object API to train my own classes. I am retraining using the faster_rcnn_resnet101_coco model.
To create the training data, I used RectLabel to put bounding boxes around objects in approx 100 images. Each image has approx 30 classes in them, for a total of 40 classes present in all the images.
My images are 1920 × 1080 in size. The images are produced by pulling random frames from videos of the objects I would like to detect.
My issue is that I am not getting any detections (Tensorboard is not showing any) and I think it is because the training images are being resized and the objects in the images are getting too small. I am using the default faster_rcnn_resnet101_coco.config file with no changes (except for locations to the data).
Would it be a good idea to perform a random crop of the images (instead of resizing as below) so as to keep the object size the same for training?
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
Could there be another issue I am overlooking?
I used to deal with a object detection problem,I got nothing at first.I trained the model for two more days ,I got the right results.
more training and more data may be helpful.
If you're worried that the resizing is making the objects to small to detect, you can use a larger input resolution. Theoretically you can do it only on your training data, but I'm not sure it would give good result with such tiny training set.
Instead, you can first fine-tune the pre-trained model with the same dataset (COCO?) on the larger input resolution, and only then fine-tune it on your training data with the larger resolution.
This way, the model will theoretically first learn to adapt to the larger resolution, and then will learn your classes.
I would also like to side with Friday2013 and suggest to get more training data, possibly more augmentation, and then more training time. Only training longer might not help if you still train on the same small amount of images, since you would get overfitting.