How SSD object detection calculates it's class scores and bbx locations? - tensorflow

As in the paper I can understand SSD try to predict object locations and their relevant class scores from different feature maps .
So for each layers there can be different predictions with respect to number of anchor(reference) boxes in different scale.
So if one convolutional feature map has 5 reference boxes there should be class scores and bbx coordinates for each of the reference box .
We do above predictions by sliding a window(kernel Ex: 3*3) over the feature maps of different layers . So what I not clear is connection from sliding window at a position to score layer .
1. It just connection of convolution window output to score layer in a fully connected way ?
2.Or we do some other operation for convolution window output before connecting it to score layer ?

The class score and bbx predictions are obtained by convolution. It's the difference between YOLO and SSD . SSD doesn't go for a fully connected way. I will explain how the score function is taken .
Above is a 8 *8 spacial sized feature map in a ssd feature extractor model. For each position in the feature map we gonna predict following
4 BBX coordinates w.r.t default boxes (showed in dotted lines)
class scores for each default boxes (c number of classes)
Let's say if we have k number of default (anchor) boxes we predict *(4+c)K
Now the tricky part . How we get those scores .
Here we use set of convolutional kernels which have depth of the feature map. (normally 3*3)
Since there are (4+C) predictions w.r.t single anchor box it's like we have (4+C) above mentioned kernels which have depth of feature map. So it's more like set of filters .
These set of filters will predict above (4+c) scalars.
So for a single feature map , if there are K number anchor box which we reference them in prediction ,
We have **K *(4+c) filters(3*3 in spacial location) are applied around each location of the feature map in a sliding window manner .**
We train those filter values !
.

Related

visualizing 2nd convolutional layer

So via tf.summary I visualized the first convolutional layer in my model, of shape [5,5,3,32], as a set of individual images, one per filter. so this layer has a filter of 5x5 dimensions, of depth 3, and there are 32 of them. Im viewing these filters as 5x5 color(RGB) images.
Im wondering how to generalize this to a second convolutional layer, and third and on...
shape of the second convolutional layer is [5,5,32,64].
my questions is how would I transform that tensor into individual 5x5x3 images?
with the first conv layer of shape [5,5,3,32] I visualize it by transposing first tf.transpose(W_conv1,(3,0,1,2)), and then having 32 5x5x3 images.
doing a tf.transpose(W_conv2,(3,0,1,2)) would produce a shape [64,5,5,32]. how would I then use those "32 color channels"? (Im know its not that simple :) ).
Visualization of higher level filters is usually done indirectly. To visualize a particular filter, you look for images that the filter would respond the most to. For that you perform gradient ascent in the space of images (instead of changing the parameters of the network like when you train the network, you change the input image).
You will understand it easier if you play with the following Keras code: https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py

How to train a classifier that contain multi dimensional featured input values

I am trying to model a classifier that contain Multi Dimensional Feature as input. Can any one knew of a dataset that contain multi dimensional Features?
Lets say for example: In mnist data we have pixel location as feature & feature value is a Single Dimensional grey scale value that varies from (0 - 255), But if we consider a colour image then in that case a single grey scale value is not sufficient, in this case also we will take the pixel location as feature but feature value will be of 3 Dimension( R(0-255) as one dimension, G(0-255) as second dimension and B(0-255) as third dimension) So in this case how can one solve using FeedForward Neural network?
SMALL SUGGESTIONS ALSO ACCEPTED.
The same way.
If you plug the pixels into your network directly just reshape the tensor to have H*W*3 length.
If you use convolutions note the the last parameter is the number of input/output dimensions. Just make sure the first convolution uses 3 as input.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.

Are there any pros to having a convolution layer using a filter the same size as the input data?

Are there any pros to having a convolution layer using a filter the same size as the input data (i.e. the filter can only fit over the input one way)?
A filter the same size as the input data will collapse the output dimensions to
1 x 1 x n_filters, which could be useful towards the end of a network that has a low dimensional output like a single number for example.
One place this is used is in sliding window object detection, where redundant computation is saved by making only one forward pass to compute the output on all windows.
However, it is more typical to add one or more dense layers that give the desired output dimension instead of fully collapsing your data with convolution layers.

Detection Text from natural images

I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny