mediapipe modifying face landmark subgraph problem - mediapipe

I want to modify facemesh example and implement another face_landmark model. I've explored face_landmark module, in face_landmark_gpu subgraph there are some node that made me confused.
As far as I know, face_landmark_gpu is responsible for getting image and faces ROIs(obtained from face detection model) and putting landmark points on it.
in this subgraph, there is a node for spliting output vector of inference calculator to 2 vector. first for landmark tensor and second for face_flag_tensor . the next node Converts the face_flag tensor into a float that represents the confidence. what confidence mean here? I believe face detection is responsile for face presence not landmark model.
my model gives a tensor of landmark points as output and no other output for confidence or face presence score.
how can I use my model instead of default mediapipe ladmark?
Any help is appreciated.

Related

Solving Imbalance Classification on Video Transcript dataset

I am currently working on a problem that requires segmenting a video lecture transcript based on the topics present within the video. My dataset consists of sentence wise labels where 1 indicates the beginning of a new segment(ie. topic) and 0 indicates the same segment. Thus the problem can be framed as a Binary Classification problem where the model takes a sentence as input and makes a binary prediction on it . However, due of the very nature of the problem, the dataset is highly imbalanced (90% 0s and 10% 1s). As a consequence, while training the model, I have noticed that my model becomes biased and starts making all 0s predictions.
I have tried resolving this issue through using class_weights in model.fit(). However, this hasnt been of much help. If I increase the penalty on 1s class, my model starts predicting all 1s. If I lower the value, the model again starts predicting all 0s. Does someone have any ideas as to how I should resolve this issue?
There are other oversampling and undesampling techniques(eg: SMOTE), however I dont think they are suitable in my use case since that would disrupt the continuity in the video transcript.
PS: I am sharing a screenshot of my model's architecture for reference.
Basically, the model takes BERT tokenized input sentences and encodes them using the Universal Sentence Encoder. This encoding is then passed to a classification layer which finally returns a tensor of shape [BATCH_SIZE, 1]. I am using BinaryCrossentropy as a loss function.
Model Architecture
Fitting the model using model.fit()

Can someone give me an explanation for Multibox loss function?

I have found some expression for SSD Multibox-loss function as follows:
multibox_loss = confidence_loss + alpha * location_loss
Can someone explains what are the explanations for those terms?
SSD Multibox (short for Single Shot Multibox Detector) is a neural network that can detect and locate objects in an image in a single forward pass. The network is trained in a supervised manner on a dataset of images where a bounding box and a class label is given for each object of interest. The loss term
multibox_loss = confidence_loss + alpha * location_loss
is made up of two parts:
Confidence loss is a categorical cross-entropy loss for classifying the detected objects. The purpose of this term is to make sure that correct label is assigned to each detected object.
Location loss is a regression loss (either the smooth L1 or the L2 loss) on the parameters (width, height and corner offset) of the detected bounding box. The purpose of this term is to make sure that the correct region of the image is identified for the detected objects. The alpha term is a hyper parameter used to scale the location loss.
The precise formulation of the loss is given in Equation 1 of the SSD: Single Shot MultiBox Detector paper.

how to generate different samples using PixelCNN?

I am trying pixelcnn, which is auto-regressive generative model. After training, the model receive an all-zero tensor and generate the next pixel form the left top coner. Now that the model parameters are fixed, does the model only can produce the same outputs starting from the same zero tensor? How to produce different samples?
Yes, you always provide an all-zero tensor. However, for PixelCNN each pixel location is represented by a distribution. So when you do the forward pass you then sample from a random distribution at the end. That is how the pixel values are different each run.
This is of course because PixelCNN is a probabilistic neural network. So the pixels, as mentioned before, are all represented by conditional probability distributions of all the layers below, not just point estimates.

Training a model to achieve DLib's facial landmarks like feature points for hands and it's landmarks

[I'm a noob in Machine Learning and OpenCV]
These below are the results i.e. 68 facial landmarks that you get on applying the DLib's Facial Landmarks model that can be found here.
It mentions in this script that the models was trained on the on the iBUG 300-W face landmark dataset.
Now, I wish to create a similar model for mapping the hand's landmarks. I have the hand dataset here.
What I don't get is:
1. how am I supposed to train the model on those positions? Would I have to manually mark each joint in every single image or is there an optimised way for this?
2. In DLib's model, each facial landmark position has a particular value for e.g., the right eyebrows are 22, 23, 24, 25, 26 respectively. At what point would they have been given those values?
3. Would training those images on DLib's shape predictor training script suffice or would I have to train the model on other frameworks (like Tensorflow + Keras) too?
how am I supposed to train the model on those positions? Would I have to manually mark each joint in every single image or is there an optimised way for this?
-> Yes, you should do it all manually. Detecting hand location, defining how many points you need to describe shape.
In DLib's model, each facial landmark position has a particular value for e.g., the right eyebrows are 22, 23, 24, 25, 26 respectively. At what point would they have been given those values?
-> From learning step, for example you want 3 points for each finger, and 2 other points for wrist, so total 15 + 2 = 17 points.
Depends on how you defined which points belong to which finger, e.g point[0] to point[2] is for thumb. and so on.
Would training those images on DLib's shape predictor training script suffice or would I have to train the model on other frameworks (like Tensorflow + Keras) too?
-> With dlib you can do everything.
#thachnb's answer above answers everything.
Labelling has to be done manually. With DLib, Imglab comes with the source and can be easily built with CMake. It allows for:
Labelling with boxes
Annotating/denoting parts (features) to an object (object of interest) For eg: face is an object and the features/landmark positions it's parts.
I was recommended Amazon Mechanical Turk a lot during my research.
The features are given those values during their labelling. You can be creative in those naming conventions until you are consistent.
As only landmarks'/feature points training is required here, DLib's pre-built Shape Predictor Trainer would suffice here. Dlib has pretty in-depth documentation so it won't be hard to follow along.
Additionally you might need some good hand data-set resources for this. Below are some great suggestions:
CVOnline: Image Databases
TwentyBN Dataset
Silesian University of Technology's Database
Hope this works out for others.

How SSD object detection calculates it's class scores and bbx locations?

As in the paper I can understand SSD try to predict object locations and their relevant class scores from different feature maps .
So for each layers there can be different predictions with respect to number of anchor(reference) boxes in different scale.
So if one convolutional feature map has 5 reference boxes there should be class scores and bbx coordinates for each of the reference box .
We do above predictions by sliding a window(kernel Ex: 3*3) over the feature maps of different layers . So what I not clear is connection from sliding window at a position to score layer .
1. It just connection of convolution window output to score layer in a fully connected way ?
2.Or we do some other operation for convolution window output before connecting it to score layer ?
The class score and bbx predictions are obtained by convolution. It's the difference between YOLO and SSD . SSD doesn't go for a fully connected way. I will explain how the score function is taken .
Above is a 8 *8 spacial sized feature map in a ssd feature extractor model. For each position in the feature map we gonna predict following
4 BBX coordinates w.r.t default boxes (showed in dotted lines)
class scores for each default boxes (c number of classes)
Let's say if we have k number of default (anchor) boxes we predict *(4+c)K
Now the tricky part . How we get those scores .
Here we use set of convolutional kernels which have depth of the feature map. (normally 3*3)
Since there are (4+C) predictions w.r.t single anchor box it's like we have (4+C) above mentioned kernels which have depth of feature map. So it's more like set of filters .
These set of filters will predict above (4+c) scalars.
So for a single feature map , if there are K number anchor box which we reference them in prediction ,
We have **K *(4+c) filters(3*3 in spacial location) are applied around each location of the feature map in a sliding window manner .**
We train those filter values !
.