Build a dataset for TensorFlow - tensorflow

I have a large number of JPGs representing vehicles. I want to create a dataset for TensorFlow with a categorization such that every vehicle image describes the side, the angle or the roof, i.e. I want to create nine subsets of images (front, back, driver side, driver front angle, driver back angle, passenger side, passenger front angle, passenge back angle, roof). At the moment the filename of each JPG describes the desired point.
How can I turn this set to be a dataset that TensorFlow can easily manipulate? Also, should I run a procedure which crop the JPG to extract only the vehicle portion? How could I do that using TensorFlow?
I apologize in advance for not providing details and examples to this question, but I don't really know how can I achieve an entry point for this problem. The tutorials I'm following all assume an already created dataset ready to use.

Okay, I'm going to try to answer this as well as I can, but producing and pre-processing data for use in ML algorithms is laborious and often expensive (hence the repeated use of well known data sets for testing algorithm designs).
To address a few straight-forward questions first:
should I run a procedure which crop the JPG to extract only the vehicle portion?
No. This isn't necessary. The neural network will sort the relevant information in the images from the irrelevant itself and having a diverse set of images will help to build a robust classifier. Also you would likely make life a lot more difficult for yourself later on by resizing images (see point 1. below for more).
How could I do that using TensorFlow?
You wouldn't. Tensorflow is designed to build and test ML models, and does not have tool for pre-processing data. (well perhaps TensorFlow Extended does, but this shouldn't be necessary)
Now a rough guideline for how you would go about creating a data set from the files described:
1) The first thing you will need to do is to load your .jpg images into python and resize them all to be identical. A neural network will need the same number of inputs (pixels in this case) in every training example, so having different sized images will not work.
There is a good answer detailing how to load images using python image library (PIL) on stack overflow here.
The PIL image instances (elements of the list loadedImages in the example above) can then be converted to numpy arrays using data = np.asarray(image), which tensorflow can work with.
In addition to building a set of numpy arrays of your data, you will also need a second numpy array of labels for this data. A typical way to encode this will be as a numpy array the same length as your number of images with an integer value for each point representing the class to which that image belongs (0-8 for your 9 classes). You could input these by hand, but this will be labour intensive, and I would suggest using python strings inbuilt find method to locate key words within the filenames to automate determining their class. This could be done within the
for image in imagesList:
loop in the above link, as image should be a string containing the image filename.
As I mentioned above, resizing the images is necessary to make sure they are all identical. You could do this with numpy, using indexing to choose a subsection of each image array, or using PIL's resize function before converting to numpy. There is no right answer here, and many methods have been used to resize images for this purpose, from padding, to stretching to cropping.
Then end result here should be 2 numpy arrays. One of image data which has shape [w,h,3,n] where w=image width, h=image height, 3 = the three RGB layers (provided images are in colour) and n= the number of images you have. The second of labels associated with these images, of shape [n,] where every element of the length n array is an integer from 0-8 specifying its class.
At this point it would be a good idea to save the dataset in this format using numpy.save() so that you don't have to go through this process again.
2) Once you have your images in this format, tensorflow has a class called tf.Dataset into which you can load the image and label data described above and will allow you to shuffle and sample data from it.
I hope that was helpful, and I am sorry that there is no quick-fix solution to this (at least not one I am aware of). Good luck.

Related

Image Detection & Classification - general approach?

I'm trying to build a detection + classification model that will recognize an object in an image and classify it. Every image will contain at most 1 object among my 10 classes (i.e. same image cannot contains 2 classes). An image can, however, contain none of my classes/objects. I'm struggling with the general approach to this problem, especially due to the nature of my problem; my objects have different sizes. This is what I have tried:
Trained a classifier with images that only contains my objects/classes, i.e. every image is the object itself with background pre-removed. Now, since the objects/images have different shapes (aspect ratios) I had to reshape the images to the same size (destroying the aspect ratios). This would work just fine if my purpose was to only build a classifier, but since I also need to detect the objects, this didn't work so good.
The second approach was similar to (1), except that I didn't reshape the objects naively, but kept the aspect ratios by padding the image with 0 (black). This completely destroyed my classifiers ability to perform well (accuracy < 5%).
Mask RCNN - I followed this blogpost to try build a detector + classifier in the same model. The approach took forever and I wasn't sure it was the right approach. I even used external tools (RectLabel) to generate annotated image files containing information about the bounding boxes.
Question:
How should I approach this problem, on a general level:
Should I build 2 separate models? (One for detection/localization and one for classification?)
Should I be annotating my images using annotations file as in approach (3)?
Do I have to reshape my images at any stage?
Thanks,
PS. In all of my approaches, I augmented the images to generate ~500-1000 images per class.
To answer your questions:
No, you don't have to build two separate models. What you are describing is called Object detection, which is classification along with localization. There are many models which do this: Mask_RCNN, Yolo, Detectron, SSD, etc..
Yes, you do need to annotate your images for training a model for your custom classes. Each of the models mentioned above has needs a different way of annotation.
No, you don't need to do any image resizing. Most of the time it is done when the model loads the data for training or inference.
You are on the right track with trying MaskRCNN.
Other than MaskRCNN, you could also try Yolo. There is also an accompanying easy-to-use annotating tool Yolo-Mark.
If you go through this tutorial, you would understand what you care about.
How to train your own Object Detector with TensorFlow’s Object Detector API
The SSD model is small so that it would not take so much time for training.
There are some object detection models.
On RectLabel, you can save bounding boxes in the PASCAL VOC format.
You can export TFRecord for Tensorflow.
https://rectlabel.com/help#tf_record

Best practise for video ground truthing?

I would like to train a deep learning framework (TensorFlow) for object detection with a new object category.
As source for the ground truthing I have multiple video files which contain the object (only part of the image contains the object).
How should I ground truth the video? Should I extract frame by frame and label every frame even when those video frames will be quite similar? Or what would be best practise for such a task?
Open source tools are preferred.
It usually works as you described. At lest for the iteration zero:
collect required examples (video)
extract valuable frames from the video (manual or partially automated process)
use OpenCV (or any other tool) to extract required details (bounding box, accurate mask)
assemble a training set
train a model
Here is an example of a training set, produced by the approach described above (see it in action)
For iteration one you might use iteration zero models and significantly improve step 2 and step 3 to increase the training set even more.
I'm trying to solve pretty much the same problem, because it is hard to produce a training set to get accurate segmentation:
(again, here it is in action and other examples)
Basically, start with a semi-manual approach and try to evolve.

Is it possible to pack multiple images into a single TFRecord example?

I'm in a situation where the input into my ML model is a variable number of images per example (but only 1 label for each set), and so I would like to be able to pack multiple images into a single TFRecord example. However, every example i come across online is single image, single label, which is understandable because that's the most common use-case. I also wonder about decoding...it appears that tf.image.decode_png only does one image at a time, but perhaps I can convert all the images to tf.string and use tf.decode_raw, then resize to get all the images?
Thanks

Different results in image classifier using JPG BMP

I trained an Image Classifier with Tensforflow using a bunch of JPG images.
Let's say I have 3 classifiers, ClassifierA, ClassifierB, ClassifierC.
When testing the classifiers, I have no issues at all in 90% of the images I use as a test. But in some cases, I have misclassifications due to the image quality.
For example, the image below is the same, saved as BMP and JPG. You'll see little differences due to the format quality.
When I test the BMP version using tf.image.decode_bmp I get misclassifications, let's say ClassifierA 70%
When I test the JPG version using tf.image.decode_jpeg I get the right one, ClassifierB 90%
When I test the JPG version using tf.image.decode_jpeg and dct_method="INTEGER_ACCURATE" I get the right one with the much better result, ClassifierB 99%
What could be the issue here? Such difference between BMP and JPG, and how can I solve this if there's a solution?
update1: I retrained my Classifier using different effects and randomly changing the quality in which I save the images I use as a dataset.
Now, I get the right output, but still the percentages changes a lot, for example44% with BMP and +90% with JPG
This is a fabulous question, and even more fabulous of an observation. I'm going to use this in my own work in the future!
I expect you have just identified a rather fascinating issue with the dataset. It appears that your model is overfitting to features specific to JPG compression. The solution is to increase data augmentation. In particular, convert your training samples between various formats randomly.
This issue also makes me think that sharpening and blurring operations would make good data augmentation features. It's common to alter color, contrast, rotation, scale, orientation, and translation of the image to augmentat the training dataset, but I don't commonly see blur and sharpness used. I suspect these two data augmentation techniques will go a long way to resolving your issue by themselves.
In case the OP (or others reading this) are not terribly familiar with what "data augmentation" is, I'll define it. It is common to warp your training images in various ways to generate endlessly unique images from your (otherwise finite) dataset. For example, randomly flipping the image left/right is quite simple, common, and effectively doubles your dataset. Changing contrast and brightness settings further alter your images. Adding these and other data augmentation transformations to your pipeline creates a much richer dataset and trains a network that is more robust to these common variations in images.
It's important that the data augmentation techniques you use produce realistic variations. For example, rotating an image is quite a realistic augmentation technique. If your training image is a cat standing horizontally, it's realistically possible that a future sample might be a cat at a 25-degree angle.

Counting Pedestrians Using TensorFlow's Object Detection

I am new to machine learning field and based on what I have seen on youtube and read on internet I conjectured that it might be possible to count pedestrians in a video using tensorflow's object detection API.
Consequently, I did some research on tensorflow and read documentation about how to install tensorflow and then finally downloaded tensorflow and installed it. Using the sample files provided on github I adapted the code related to object_detection notebook provided here ->https://github.com/tensorflow/models/tree/master/research/object_detection.
I executed the adapted code on the videos that I collected while making changes to visualization_utils.py script so as to report number of objects that cross a defined region of interest on the screen. That is I collected bounding boxes dimensions (left,right,top, bottom) of person class and counted all the detection's that crossed the defined region of interest (imagine a set of two virtual vertical lines on video frame with left and right pixel value and then comparing detected bounding box's left & right values with predefined values). However, when I use this procedure I am missing on lot of pedestrians even though they are detected by the program. That is the program correctly classifies them as persons but sometimes they don't meet the criteria that I defined for counting and as such they are not counted. I want to know if there is a better way of counting unique pedestrians using the code rather than using the simplistic method that I am trying to develop. Is the approach that I am using the right one ? Could there be other better approaches ? Would appreciate any kind of help.
Please go easy on me as I am not a machine learning expert and just a novice.
You are using a pretrained model which is trained to identify people in general. I think you're saying that some people are pedestrians whereas some other people are not pedestrians, for example, someone standing waiting at the light is a pedestrian, but someone standing in their garden behind the street is not a pedestrian.
If I'm right, then you've reached the limitations of what you'll get with this model and you will probably have to train a model yourself to do what you want.
Since you're new to ML building your own dataset and training your own model probably sounds like a tall order, there's a learning curve to be sure. So I'll suggest the easiest way forward. That is, use the object detection model to identify people, then train a new binary classification model (about the easiest model to train) to identify if a particular person is a pedestrian or not (you will create a dataset of images and 1/0 values to identify them as pedestrian or not). I suggest this because a boolean classification model is about as easy a model as you can get and there are dozens of tutorials you can follow. Here's a good one:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/neural_network.ipynb
A few things to note when doing this:
When you build your dataset you will want a set of images, at least a few thousand along with the 1/0 classification for each (pedestrian or not pedestrian).
You will get much better results if you start with a model that is pretrained on imagenet than if you train it from scratch (though this might be a reasonable step-2 as it's an extra task). Especially if you only have a few thousand images to train it on.
Since your images will have multiple people in it you have a problem of identifying which person you want the model to classify as a pedestrian or not. There's no one right way to do this necessarily. If you have a yellow box surrounding the person the network may be successful in learning this notation. Another valid approach might be to remove the other people that were detected in the image by deleting them and leaving that area black. Centering on the target person may also be a reasonable approach.
My last bullet-point illustrates a problem with the idea as it's I've proposed it. The best solution would be to alter the object detection network to ouput both a bounding box per-person, and also a pedestrian/non pedestrian classification with it; or to only train the model to identify pedestrians, specifically, in the first place. I mention this as more optimal, but I consider it a more advanced task than my first suggestion, and a more complex dataset to manage. It's probably not the first thing you want to tackle as you learn your way around ML.