Clarifai - Returning regions for custom trained models - clarifai

The documentation shows that only concepts are returned for custom trained models:
{
"status": {
"code": 10000,
"description": "Ok"
},
"outputs": [
...,
"created_at": "2016-11-22T16:59:23Z",
"model": {
...
},
"model_version": {
...
}
}
},
"input": {
"id": "e1cf385843b94c6791bbd9f2654db5c0",
"data": {
"image": {
"url": "https://s3.amazonaws.com/clarifai-api/img/prod/b749af061d564b829fb816215f6dc832/e11c81745d6d42a78ef712236023df1c.jpeg"
}
}
},
"data": {
"concepts": [
{
...
},
Whereas pre-trained models such as demographic and face return regions with the x/y location in the image.
If I want to detect WHERE in the image the concept is predicted for my custom models. Is my only option to split the image into a grid and submit as bytes? This seems counter-productive as this would incur additional lookups.

In the Clarifai platform: Demographics, Face Detections and Apparel Detections are all object detection models. General, travel, food, etc. are classification models. Classification and object detection are two different (although similar seeming) computer vision tasks.
For example, if you're looking to classify an image as 'sad', it doesn't make sense to have a bounding box (i.e. area outlining) the 'sadness'. Classification takes into account the entire image.
Object detection, looks at individual parts of the image and tries to see if the object is there (kind of like you were suggesting with your work around). So where is the 'knife' or whatever you're looking for as a discrete object.
Confusingly, you could have conceptual overlap, such as having a concept of 'face'. You could have a picture have this classification, but there could also be a specific 'face' object that is detected at a specific place. Classifications are not limited to abstract concepts (although it is helpful to think of them that when when thinking about the differences between these two approaches).
Right now all custom models are classification models and not object detection models. I think there is work being done for this on the enterprise level of the system but I don't believe there is anything currently available. The general models you are using sound like they happen to be object detection models - so you get some bonus information with them!
BTW: If I understand it, your proposed workaround should work, by basically splitting the image up into small images and asking for classification on each of them. You're right it would be inefficient, but I'm not sure of a better option at the moment.

Related

Model training - cropped image of the object VS bigger image with bounding box

I need to train a new model(keras+tensorflow) and I was asking myself if there is any difference between
Providing a bunch of images containing only the object of interest(cropped from the original image)
Providing bigger images with object annotations(coordinates of the bounding box and the class)
My logic tells my that most probably internally the training should be done only on the cropped part, so technically there shouldn`t be a difference.
Regards
The two approaches are you describing are commonly referred to as image classification (where a model needs to only classify the image) and object detection (where a model needs to detect the location of an object in an image and classify it). Sometimes simply differentiated as "classification" and "detection". These two approaches require different techniques, and different models have been developed to handle each approach. In general, image classification is an easier problem as you may have intuited.
Which approach to use depends on your end application. If you only need to know, "does an object exist in this image" then you can use classification techniques. If you need to know "where in this image is the object" or "how many of these objects are in the image", then you should use detection techniques.
What may be non-intuitive is that object detection is not simply an extension of image classification, so if you need object detection it is best to start with object detection models instead of building an image classifier that you then extend to object detection. This article provides some intuition on this topic.

Multi-label image classification vs. object detection

For my next TF2-based computer vision project I need to classify images to a pre-defined set of classes. However, multiple objects of different classes can occur on one such image. That sounds like an object detection task, so I guess I could go for that.
But: I don't need to know where on an image each of these objects are, I just need to know which classes of objects are visible on an image.
Now I am thinking which route I should take. I am in particular interested in a high accuracy/quality of the solution. So I would prefer the approach that leads to better results. Thus from your experience, should I still go for an object detector, even though I don't need to know the location of the detected objects on the image, or should I rather build an image classifier, which could output all the classes that are located on an image? Is this even an option, can a "normal" classifier output multiple classes?
Since you don't need the object localization, stick to classification only.
Although you will be tempted to use the standard off-the-shelf network of multi-class multi-label object detection because of its re-usability, but realize that you are asking the model to do more things. If you have tons of data - not a problem. Or if your objects are similar to the ones used in ImageNet/COCO etc, you can simply use standard off-the-shelf object detection architecture and fine-tune on your dataset.
However, if you have less data and you need to train from scratch (e.g. medical images, weird objects), then object detection will be an overkill and will give you inferior results.
Remember, most of the object detection networks re-cycle the classification architectures with modifications added to last layers to incorporate additional outputs for object detection coordinates. There is a loss function associated with those additional outputs. During training in order to get best loss value, some of the classification accuracy is compromised for the sake of getting better object localization coordinates. You don't need that compromise. So, you can modify the last layer of object detection network and remove the outputs for coordinates.
Again, all this hassle is worth only if you have less data and you really need to train from scratch.

Training from remote resources

All,
I've researched this some and haven't found a clear answer anywhere.
Using Keras with TF backend, how can you train a model using assets (like images for example) that are not local, but remote assets.
For example, if you have 1M images on s3 that are labeled but not organized by folder, is there a practical way to stream in data in a way Keras can use to train a model?
My thinking is that I would supply a file that was of the format:
{ label: "Apple", img: http://someurl/img.jpg }
{ label: "Banana", img: http://someurl/img.jpg }
{ label: "Orange", img: http://someurl/img.jpg }
You could use preprocessing.load_img or pillow to grab and resize the url.
This question is more about the correct process for this and the feasibility?
This would be possible by mirroring Keras' generator API. You can make a standard python generator that has an index of image URLs, and yields batches of images loaded from those URLs.
However, I would not recommend this approach. Loading images from the web introduces extra latency, which has the potential to significantly slow down your model's training. The only case where it might be a good idea is if you literally do not have the space on your SSD to store the whole dataset, and/or you find that the time it takes to load a batch of images images is short compared to the time it takes to train on that batch.

how to use tensorflow object detection API for face detection

Open CV provides a simple API to detect and extract faces from given images. ( I do not think it works perfectly fine though because I experienced that it cuts frames from the input pictures that have nothing to do with face images. )
I wonder if tensorflow API can be used for face detection. I failed finding relevant information but hoping that maybe an experienced person in the field can guide me on this subject. Can tensorflow's object detection API be used for face detection as well in the same way as Open CV does? (I mean, you just call the API function and it gives you the face image from the given input image.)
You can, but some work is needed.
First, take a look at the object detection README. There are some useful articles you should follow. Specifically: (1) Configuring an object detection pipeline, (3) Preparing inputs and (3) Running locally. You should start with an existing architecture with a pre-trained model. Pretrained models can be found in Model Zoo, and their corresponding configuration files can be found here.
The most common pre-trained models in Model Zoo are on COCO dataset. Unfortunately this dataset doesn't contain face as a class (but does contain person).
Instead, you can start with a pre-trained model on Open Images, such as faster_rcnn_inception_resnet_v2_atrous_oid, which does contain face as a class.
Note that this model is larger and slower than common architectures used on COCO dataset, such as SSDLite over MobileNetV1/V2. This is because Open Images has a lot more classes than COCO, and therefore a well working model need to be much more expressive in order to be able to distinguish between the large amount of classes and localizing them correctly.
Since you only want face detection, you can try the following two options:
If you're okay with a slower model which will probably result in better performance, start with faster_rcnn_inception_resnet_v2_atrous_oid, and you can only slightly fine-tune the model on the single class of face.
If you want a faster model, you should probably start with something like SSDLite-MobileNetV2 pre-trained on COCO, but then fine-tune it on the class of face from a different dataset, such as your own or the face subset of Open Images.
Note that the fact that the pre-trained model isn't trained on faces doesn't mean you can't fine-tune it to be, but rather that it might take more fine-tuning than a pre-trained model which was pre-trained on faces as well.
just increase the shape of the input, I tried and it's work much better

Tensorflow high false-positive rate and non-max-suppression issue

I am training Tensorflow Object detection on Windows 10using faster_rcnn_inception_v2_coco as pretrained model. I'm on Windows 10, with tensorflow-gpu 1.6 on NVIDIA GeForce GTX 1080, CUDA 9.0 and CUDNN 7.0.
My dataset contain only one object, "Pistol", and 3000 images (2700 train set, 300 test set). The size of the images are from ~100x200 to ~800x600.
I trained this model for 55k iterations, where the mAP was ~0.8 and the TotalLoss seems converged to 0.001. But however, seeing the evaluation, that there are a lot of multiple bounding boxes on the same detected object (e.g. this and this), and lot of false positives (house detected as a pistol). For example, in this photo taked by me (blur filter was applied later), the model detect a person and a car as pistols, as well as the correct detection.
The dataset is uploaded here, together with the tfrecords and the label map.
I used this config file, where the only things that I changed are: num_classes to 1, the fine_tune_checkpoint, input_path and label_map_path for train and eval, and num_examples.
Since I thought that the multiple boxes are a non-max-suppression problem, I changed the score_threshold (line 73) from 0 to 0.01 and the iou_threshold (line 74) from 1 to 0.6. With the standard values the outcome was much worse than this.
What can I do to have a good detection? What should I change? Maybe I miss something about parameters tuning...
Thanks
I think that before diving into paramter tuning (i.e. the mentioned score_threshold) you will have to review your dataset.
I didn't check the entire dataset you shared but from a high level view the main problem I found is that most of the images are really small and with a highly variable aspect ratio.
In my opinion this enters in conflict with this part of your configuration file:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
If take one of the images of your dataset and you manually apply that transformation you will see that the result is very noisy for small images and very deformed for many images that have a different aspect ratio.
I would highly recommend you to re-build your dataset with images with more definition and maybe try to preprocess the images with unusual aspect ration with padding, cropping or other strategies.
If you want to stick with the small images you'd have to at least change the min and max dimensions of the image_resizer but, from my experience, the biggest problem here is the dataset and I would invest the time in trying to fix that.
Pd.
I don't see the house false positive as a big problem if we consider that it's from a totally different domain of your dataset.
You could probably adjust the minium confidence to consider a detections as true positive and remove it.
If you take the current winner of COCO and feed it with strange images like from a cartoon you will see that it generates a lot of false positives.
So it's more like a problem with the current object detection approaches wich are not robust to domain changes.
A lot of people I see online have been running into the same issue using Tensorflow API. I think there are some inherent problems with the idea/process of using the pretrained models with custom classifier(s) at home. For example people want to use SSD Mobile or Faster RCNN Inception to detect objects like "Person w/ helmet," "pistol," or "tool box," etc. The general process is to feed in images of that object, but most of the time, no matter how many images...200 to 2000, you still end up with false positives when you go actually run it at your desk.
The object classifier works great when you show it the object in its own context, but you end up getting 99% match on every day items like your bedroom window, your desk, your computer monitor, keyboard, etc. People have mentioned the strategy of introducing negative images or soft images. I think the problem has to do with limited context in the images that most people use. The pretrained models were trained with over a dozen classifiers in many variety of environments like in one example could be a Car on the street. The CNN sees the car and then everything in that image that is not a car is a negative image which includes the street, buildings, sky, etc.. In another image, it can see a Bottle and everything in that image which includes desks, tables, windows, etc. I think the problem with training custom classifiers is that it is a negative image problem. Even if you have enough images of the object itself, there isn't enough data of that that same object in different contexts and backgrounds. So in a sense, there is not enough negative images even if conceptually you shouldn't need negative images. When you run the algorithm at home you get false positives all over the place identifying objects around your own room. I think the idea of transfer learning in this way is flawed. We just end up seeing a lot of great tutorials online of people identifying playing cards, Millenium Falcons, etc., but none of those models are deployable in the real world as they all would generate a bunch of false positives when it sees anything outside of its image pool. The best strategy would be to retrain the CNN from scratch with a multiple classifiers and add the desired ones in there as well. I suggest re-introducing a previous dataset from ImageNet or Pascal with 10-20 pre-existing classifiers and add your own ones and retrain it.