Training a model to achieve DLib's facial landmarks like feature points for hands and it's landmarks - tensorflow

[I'm a noob in Machine Learning and OpenCV]
These below are the results i.e. 68 facial landmarks that you get on applying the DLib's Facial Landmarks model that can be found here.
It mentions in this script that the models was trained on the on the iBUG 300-W face landmark dataset.
Now, I wish to create a similar model for mapping the hand's landmarks. I have the hand dataset here.
What I don't get is:
1. how am I supposed to train the model on those positions? Would I have to manually mark each joint in every single image or is there an optimised way for this?
2. In DLib's model, each facial landmark position has a particular value for e.g., the right eyebrows are 22, 23, 24, 25, 26 respectively. At what point would they have been given those values?
3. Would training those images on DLib's shape predictor training script suffice or would I have to train the model on other frameworks (like Tensorflow + Keras) too?

how am I supposed to train the model on those positions? Would I have to manually mark each joint in every single image or is there an optimised way for this?
-> Yes, you should do it all manually. Detecting hand location, defining how many points you need to describe shape.
In DLib's model, each facial landmark position has a particular value for e.g., the right eyebrows are 22, 23, 24, 25, 26 respectively. At what point would they have been given those values?
-> From learning step, for example you want 3 points for each finger, and 2 other points for wrist, so total 15 + 2 = 17 points.
Depends on how you defined which points belong to which finger, e.g point[0] to point[2] is for thumb. and so on.
Would training those images on DLib's shape predictor training script suffice or would I have to train the model on other frameworks (like Tensorflow + Keras) too?
-> With dlib you can do everything.

#thachnb's answer above answers everything.
Labelling has to be done manually. With DLib, Imglab comes with the source and can be easily built with CMake. It allows for:
Labelling with boxes
Annotating/denoting parts (features) to an object (object of interest) For eg: face is an object and the features/landmark positions it's parts.
I was recommended Amazon Mechanical Turk a lot during my research.
The features are given those values during their labelling. You can be creative in those naming conventions until you are consistent.
As only landmarks'/feature points training is required here, DLib's pre-built Shape Predictor Trainer would suffice here. Dlib has pretty in-depth documentation so it won't be hard to follow along.
Additionally you might need some good hand data-set resources for this. Below are some great suggestions:
CVOnline: Image Databases
TwentyBN Dataset
Silesian University of Technology's Database
Hope this works out for others.

Related

Is it possible to train a NN in Keras with features that won't be available for prediction?

I'm fairly new to this topic as a whole and struggle to wrap my head even the basics of neural networks in general. Not looking for a project plan, appreciate that you probably have better things to do.
Nonetheless, any idea or push in the right direction is appreciated.
Imaging a grey-box model of some kind, thermal network, electrical network, so on, it's desirable to predict returns based on a very few features with an underlying smart model that is trained with a much bigger dataset.
My question would be if it is possible to train a model with features and define mandatory and some sort of good-to-have features for the predictions?
Any tips are appreciated.
Cheers
Yes, you can train your model like that. But you must feed all the features during prediction. For example, you have 30 mandatory features and 10 optional features. The total is 40. You must feed all the 40 features to get a prediction from your model. Input data shape must be the same always. But we asked for optional features, but why I'm being forced now? well, I will discuss two options.
Option 1: set input shape to None. If you set the input shape to None, your model will accept any shape of input. But if you will have to handle some stuff. You can't use the MaxPooling layer. If you really need to use MaxPool, you will need to calculate input, output shape for all the layers only using the mandatory feature shapes(minimum input shape). If you calculate using (mandatory_feature+optional feature), you will end up with an error. Because the shape of the input of the maxpool layer can become too small and can't be reduced. Take care of that stuff and you're good to go.
Option 2: I will give you an example. I was using the OpenPose output dataset to classify some movements. OpenPose output = 18 bone key points, 36 features including x and y coordinates. These key points were being extracted from live camera frames. But we can't say all the body parts of a human will be always inside the frame. When someone's legs are outside of the frame, we can't get their legs keypoints. But still, we need to classify. There were a lot of options. We could replace the missing keypoints with 0 or find median/mean of such poses and use that value as key points. We found the best case by analyzing all the data. If you're going with option 2, I will suggest you analyze the data first, then decide how you're going to handle the missing field.

Binary classification of every time series step based on past and future values

I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
http://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.

Changing a trained network to keep only a subset of its output

Suppose I have a trained TensorFlow classification network for 20 classes as in PASCAL VOC 2007: aeroplane, bicycle, ..., car, cat, ..., person, ..., tvmonitor.
Now, I would like to have a sub-network for only a subset of the classes, e.g., 3 classes: car, cat, person.
Then, I can use this network for testing or for re-training/fine-tuning on a new dataset, only for the 3 classes.
It should be possible to extract this sub-network out of the original network, since it is only the last layer that will change. We need to discard the neurons/weights for the discarded classes.
My question: Is there an easy way to do this in TensorFlow?
It will be great if you can point to some sample code or similar solution.
I have googled, but have not come across any mention of this.
The symmetric problem, expanding the number of classes without discarding the original weights, can potentially be useful for some people, but my current focus is the one above.
If you want to only keep the output for a few slices, you could simply extract the corresponding slices from the last layer.
For example, let's assume the last layer is fully connected. Its weights are a tensor of size num_previous x num_output.
You want to keep only a few of these outputs, says output 1, 22, and 42. You can get the weights of your new fully connected layer as:
outputs_to_keep = [1, 22, 42]
new_W = tf.transpose(tf.gather(tf.transpose(old_W), outputs_to_keep))
It is possible to extract a pretrained subnet as you said. It is called transfer learning. There are different ways to do it, here you have one:
Find the layer you want to start with. You can use Tensorboard to find it and then use graph.get_tensor_by_name() Usually you keep the convolutional layers and discard the fully connected ones.
Connect your new layers (normally fully connected ones) to the previous layer.
Freeze the variables (weights) of the pretrained layers using trainable=false. Alternatively, you can instruct the optimizer to update only the weights from the new layers.
Train your model with the new classes.

Pattern recognition on sphere (HEALPY based)

I am using Tensorflow and Keras. Is there a possibility to achieve a proper pattern recognition for images on the surface of a sphere? I am using the (Healpy framework) to create my skymaps on which the pattern recognition should work. The problem is that these Healpy skymaps are one dimensional numpy arrays, thus, a compact sub-pattern may be distributed scattered over this 1d array. This is actually pretty hard to learn for a basic machine learning algorithm (i am thinking about a convolutional deep network).
A specific task in this context would be counting blobbs on the surface of a sphere (see attached image). For this particular task the correct number would be 8. So I created 10000 skymaps (Healpy settings: nside=16 correpsonding to npix=3072) each with a random number of blobbs between 0 and 9 (thus 10 possibilities). I tried to solve this with the 1d Healpy array and a simple Feed Forward network:
model = Sequential()
model.add(Dense(npix, input_dim=npix, init='uniform', activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(10, init='uniform', activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(skymaps, number_of_correct_sources, batch=100, epochs=10, validation_split=1.-train)
, however, after training with 10,000 skymaps the test set yielded an accuracy of only 38%. I guess that this will significantly increase when providing the real arrangement of the Healpy cells (as it appears on the sphere) instead of the 1d array only. In this case one may use a Convolutional network (Convolution2d) and operate as for the usual image recognition. Any ideas how to map the healpy cells properly in a 2d array or using a convolutional network directly on the sphere?
Thanks!
This is a hard way of tackling a relatively simple problem that is unashamedly 2-D!
If the objects you are looking for are as prominent as those in your figure, create the 2_d map for the data and then threshold it for a series of threshold levels: the highest thresholds pick out the brightest objects. Any continuous projection like Aitoff or Hammmer will do, and to eliminate the edge problems, use rotations of the projection. Segmented projections, like Healpix, are good for data storage, but not necessarily ideal for data analysis.
If the map has poor signal to noise so that you are looking for objects in the murk of the noise, then some sophistication is required, maybe even some neural net algorithm. However, you might take a look at the Planck data analysis on Sunyaev-Zeldovich galaxy clusters, the earliest of which is perhaps https://arxiv.org/abs/1101.2024 (Paper VIII). The subsequent papers refine and add to this.
(This should have been a comment but I lack the rep.)