I am using tensorflowjs body pix model to get the keypoints of a human body. How can i get the distance between two keypoints.
Thank you
Related
I am using the tensorflow website's tutorial for time series modeling on my own data. The models are pretty good, but my target value is always positive and sometimes the model predicts a negative value. Is there a way to clip the model predictions to a range?
I want to modify facemesh example and implement another face_landmark model. I've explored face_landmark module, in face_landmark_gpu subgraph there are some node that made me confused.
As far as I know, face_landmark_gpu is responsible for getting image and faces ROIs(obtained from face detection model) and putting landmark points on it.
in this subgraph, there is a node for spliting output vector of inference calculator to 2 vector. first for landmark tensor and second for face_flag_tensor . the next node Converts the face_flag tensor into a float that represents the confidence. what confidence mean here? I believe face detection is responsile for face presence not landmark model.
my model gives a tensor of landmark points as output and no other output for confidence or face presence score.
how can I use my model instead of default mediapipe ladmark?
Any help is appreciated.
I am using https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3 to extract image feature vectors. However, I'm confused when it comes to how to preprocess the images prior to passing them through the module.
Based on the related Github explanation, it's said that the following should be done:
image_path = "path/to/the/jpg/image"
image_string = tf.read_file(image_path)
image = tf.image.decode_jpeg(image_string, channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
# All other transformations (during training), in my case:
image = tf.random_crop(image, [224, 224, 3])
image = tf.image.random_flip_left_right(image)
# During testing:
image = tf.image.resize_image_with_crop_or_pad(image, 224, 224)
However, using the aforementioned transformation, the results I am getting suggest that something might be wrong. Moreover, the Resnet paper is saying that the images should be preprocessed by:
A 224×224 crop is randomly sampled from an image or its
horizontal flip, with the per-pixel mean subtracted...
which I can't quite understand what is means. Can someone point me in the right direction?
Looking forward to you answers!
The image modules on TensorFlow Hub all expect pixel values in range [0,1], like you get in your code snippet above. This makes it easy and safe to switch between modules.
Inside the module, the input values are scaled to the range that the network was trained for. The module https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/3 has been published from a TF-Slim checkpoint (see documentation), which uses yet another convention for normalizing inputs than He&al. -- but all this is taken care of.
To demystify the language in He&al.: it refers to the mean R, G and B values aggregated over all pixels of the dataset they studied, following the old wisdom that normalizing inputs to zero mean helps neural networks train better. However, later papers on image classification no longer expended this degree of attention to dataset-specific preprocessing.
The citation from the Resnet paper you mentioned is based on the following explanation from the Alexnet paper:
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of256×256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and thencropped out the central 256×256patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel.
So in the Resnet paper, a similar process consist in taking a of 224x224 pixels part of the image (or of its horizontally flipped version) to ensure the network is given constant-sized images, and then center it by substracting the mean.
I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.
As in the paper I can understand SSD try to predict object locations and their relevant class scores from different feature maps .
So for each layers there can be different predictions with respect to number of anchor(reference) boxes in different scale.
So if one convolutional feature map has 5 reference boxes there should be class scores and bbx coordinates for each of the reference box .
We do above predictions by sliding a window(kernel Ex: 3*3) over the feature maps of different layers . So what I not clear is connection from sliding window at a position to score layer .
1. It just connection of convolution window output to score layer in a fully connected way ?
2.Or we do some other operation for convolution window output before connecting it to score layer ?
The class score and bbx predictions are obtained by convolution. It's the difference between YOLO and SSD . SSD doesn't go for a fully connected way. I will explain how the score function is taken .
Above is a 8 *8 spacial sized feature map in a ssd feature extractor model. For each position in the feature map we gonna predict following
4 BBX coordinates w.r.t default boxes (showed in dotted lines)
class scores for each default boxes (c number of classes)
Let's say if we have k number of default (anchor) boxes we predict *(4+c)K
Now the tricky part . How we get those scores .
Here we use set of convolutional kernels which have depth of the feature map. (normally 3*3)
Since there are (4+C) predictions w.r.t single anchor box it's like we have (4+C) above mentioned kernels which have depth of feature map. So it's more like set of filters .
These set of filters will predict above (4+c) scalars.
So for a single feature map , if there are K number anchor box which we reference them in prediction ,
We have **K *(4+c) filters(3*3 in spacial location) are applied around each location of the feature map in a sliding window manner .**
We train those filter values !
.