CNN vs SVM for smile intensity detection training? - tensorflow

I have a dataset made up of images of faces, with the corresponding landmarks that make up the mouth.
These landmarks are sets of 2D points (x,y pixel position).
Each image-landmark set data pair is tagged as either a smile, or neutral.
What i would like to do is train a deep learning model to return a smile intensity for a new image-landmark data pair.
What should I be searching for to help me with the next step?
Is it a CNN that i need? In my limited understanding, the usual training input is just an image, where I would be passing the landmark sets to train with. Or would an SVM approach be more accurate?
I am looking for maximum accuracy, as much as is possible.
What is the approach that I need called?
I am happy to use PyTorch, Dlib or any framework, I am just a little stuck on the search terms to help me move forward.
Thank you.

It's hard to tell without looking into the dataset and experimenting. But hopefully, the following research materials will guide you in the right direction.
Machine learning-based approach:
https://www.researchgate.net/publication/266672947_Estimating_smile_intensity_A_better_way
Deep learning (CNN): https://arxiv.org/pdf/1602.00172.pdf
A list of awesome papers for smile and smile intensity detection:
https://github.com/EvelynFan/AWESOME-FER/blob/master/README.md
SmileNet project: https://sites.google.com/view/sensingfeeling/
Now, I'm assuming you don't have any label for actual smile intensity.
In such a scenario, the existing smile detection methods can be used directly, you'll use the last activation output (sigmoid) as a confidence score for smiling. If the confidence is higher, the intensity should be higher.
Now, you can use the facial landmark points as separate features (pass them through an LSTM block) and concatenate to the CNN at an early stage/ or later to improve the performance of your model.
If you have the label for smiling intensity, you can just solve it as a regression problem, the CNN will have one output, will try to regress the smile intensity (the normalized smile intensity with sigmoid in this case).

Related

Change the spatial input dimension during training

I am training a yolov4 (fully convolutional) in tensorflow 2.3.0.
I would like to change the spatial input shape of the network during training, to further adjust the weights to different scales.
Is this possible?
EDIT:
I know of the existence of darknet, but it suffers from some very specific augmentations I use and have implemented in my repo, that is why I ask explicitly for tensorflow.
To be more precisely about what I want to do.
I want to train for several batches at Y1xX1xC then change the input size to Y2xX2xC and train again for several batches and so on.
It is not possible. In the past people trained several networks for different scales but the current state-of-the-art approach is feature pyramids.
https://arxiv.org/pdf/1612.03144.pdf
Another great candidate is to use dilated convolution which can learn long distance dependencies among pixels with varying distance. You can concatenate the outputs of them and the model will then learn which distance is important for which case
https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5
It's important to mention which TensorFlow repository you're using. You can definitely achieve this. The idea is to keep the fixed spatial input dimension in a single batch.
But even better approach is to use the darknet repository from AlexeyAB: https://github.com/AlexeyAB/darknet
Just set, random = 1 https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4.cfg [line 1149]. It will train your network with different spatial dimensions randomly.
One thing you can do is, start your training with AlexeyAB repo with random=1 set, then take the trained weights file to tensorflow for fine-tuning.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

Do I need every class in a training image for object detection?

I just try to dive into TensorFlows Object Detection. I have a very small training set of circa 40 images yet. Each image can have up to 3 classes. But now the question came into my mind: Does every training image need every class? Is that important for efficient training? Or is it okay if an image may only have one of the object classes?
I get a very high total loss with ~8.0 and thought this might be the reason for this but I couldn't find an answer.
In general machine learning systems can cope with some amount of noise.
An image missing labels or having the wrong labels is fine as long as overall you have sufficient data for the model to figure it out.
40 examples for image classification sounds very small. It might work if you start with a pre-trained image network and there are few classes that are very easy to distinguish.
Ignore absolute the loss value, it doesn't mean anything. Look at the curve to see that the loss is decreasing and stop the training when the curve flattens out. Compare the loss value to a test dataset to check if the values are sufficiently similar (you are not overfitting). You might be able to compare to another training of the exact same system (to check if the training is stable for example).

Predicting new values in logistic regression

I am building a logistic regression model in tensorflow to approximate a function.
When I randomly select training and testing data from the complete dataset, I get a good result like so (blue are training points; red are testing points, the black line is the predicted curve):
But when I select the spatially seperate testing data, I get terrible predicted curve like so:
I understand why this is happening. But shouldn't a machine learning model learn these patterns and predict new values?
Similar thing happens with a periodic function too:
Am I missing something trivial here?
P.S. I did google this query for quite some time but was not able to get a good answer.
Thanks in advance.
What you are trying to do here is not related to logistic regression. Logistic regression is a classifier and you are doing regression.
No, machine learning systems aren't smart enough to learn to extrapolate functions like you have here. When you fit the model you are telling it to find an explanation for the training data. It doesn't care what the model does outside the range of training data. If you want it to be able to extrapolate then you need to give it extra information. You could set it up to assume that the input belonged to a sine wave or a quadratic polynomial and have it find the best fitting one. However, with no assumptions about the form of the function you won't be able to extrapolate.

deep learning for shape localization and recognition

There is a set of images, each of which contains different shape entities, such as shown in the following figure. I am trying to localize and recognize these different shapes. For instance, adding a bounding box for each different shape and maybe even label it. What are the major research papers/deep learning models that have been able to solve this kind of problem?
Object detection papers such as rcnn, faster rcnn, yolo and ssd would help you solve this if you were bent on using a deep learning approach.
It’s easy to say this is a trivial problem that can be solved with tools in OpenCV and deep learning is overkill, but I can see many reasons to use deep learning tools and that does not answer your question.
We assume that your shapes has different scales and rotations. Actually your main image shown above is very large for training process and it needs a lot of training samples to generate a good accuracy at the end on test samples. In this case it is better to train a Convolutional Neural Network on a short images (like 128x128) with only one shape per each image and then use slide trick!
This project will have three main steps:
Generate test and train samples, each image should have only one shape
Train a classifier to recognize a single shape within each input image
Use slide trick! Break your original image containing many shapes to overlapping blocks of size 128x128. Pass each block to your model trained in the second step.
In this way at the end you will have label for each shape from your trained model, and also you will have location of each shape using slide trick.
For the classifier you can use exactly CNN structure of Tensorflow's MNIST tutorial.
Here is a paper with exactly same method applied to finger print images to extract local features.
A direct fingerprint minutiae extraction approach based on convolutional neural networks