What is meant by visualizing an embedding space(neural network)? - tensorflow

I was reading about an activity recognition paper https://arxiv.org/pdf/1705.07750.pdf. Here, they use 3D convolution on inception v1 to perform activity recognition. I was listening to a talk that said visualizing embedding space of the features from the video.
1) What does it mean to visualize an embedding space? Are you looking at the filters that it has learnt or are you looking for clusterings of similar activities?
2) Do you just visualize the weight matrix for seeing the features that it is capturing? If yes, which weight matrix?
3)Does tf.summary.image() help in visualizing the weight matrix?

The embedding space is the space of the features produced by some learning algorithm. In the specific case of a (convolutional) neural network, this usually means one of the output feature maps (flattened) at some predefined layer or the output of one of the fully connected layers.
What one would visualize is not the weight matrix, but the values of the produced features for some input test data. For example one takes the full test set and passes it through the network and computes the features for each image at a specific layer, and then visualizes those values.
TensorBoard has functionality to automatically visualize embeddings and other feature spaces, you should take a look at it.
Note that in some application contexts like NLP an embedding has a slightly different definition but the use is the same.

Related

General usefulness of Dense layers for different identification tasks

I'd like to ask, is it practical to use embeddings and similarity metrics to any form of identification task? If I had a neural network trained to find different objects in a photo, would extracting the fully-connected layers/Dense layers and clustering them be useful?
I've recently found that there is an embeddings projector tool from tensorflow that is very cool and useful. I know that there has been some work in word embeddings and how similar words cluster together. This is the case for faces as well.
Having said that, I want to follow the same methods into analyzing geological sites; can I train a model to create embeddings of the features of a site and use clustering methods to classify?
Yes, we can do that. We can use embeddings for images and visualize the embeddings in the tensorboard.
You can replicate using the fashion mnist embedding example found here for your use case.

Where are the filter image data in this TensorFlow example?

I'm trying to consume this tutorial by Google to use TensorFlow Estimator to train and recognise images: https://www.tensorflow.org/tutorials/estimators/cnn
The data I can see in the tutorial are: train_data, train_labels, eval_data, eval_labels:
((train_data,train_labels),(eval_data,eval_labels)) =
tf.keras.datasets.mnist.load_data();
In the convolutional layers, there should be feature filter image data to multiply with the input image data? But I don't see them in the code.
As from this guide, the input image data matmul with filter image data to check for low-level features (curves, edges, etc.), so there should be filter image data too (the right matrix in the image below)?: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks
The filters are the weight matrices of the Conv2d layers used in the model, and are not pre-loaded images like the "butt curve" you gave in the example. If this were the case, we would need to provide the CNN with all possible types of shapes, curves, colours, and hope that any unseen data we feed the model contains this finite sets of images somewhere in them which the model can recognise.
Instead, we allow the CNN to learn the filters it requires to sucessfully classify from the data itself, and hope it can generalise to new data. Through multitudes of iterations and data( which they require a lot of), the model iteratively crafts the best set of filters for it to succesfully classify the images. The random initialisation at the start of training ensures that all filters per layer learn to identify a different feature in the input image.
The fact that earlier layers usually corresponds to colour and edges (like above) is not predefined, but the network has realised that looking for edges in the input is the only way to create context in the rest of the image, and thereby classify (humans do the same initially).
The network uses these primitive filters in earlier layers to generate more complex interpretations in deeper layers. This is the power of distributed learning: representing complex functions through multiple applications of much simpler functions.

How to sweep a neural-network through an image with tensorflow?

My question is about finding an efficient (mostly in term of parameters count) way to implement a sliding window in tensorflow (1.4) in order to apply a neural network through the image and produce a 2-d map with each pixel (or region) representing the network output for the corresponding receptive field (which in this case is the sliding window itself).
In practice, I'm trying to implement either a MTANN or a PatchGAN using tensorflow, but I cannot understand the implementation I found.
The two architectures can be briefly described as:
MTANN: A linear neural network with input size of [1,N,N,1] and output size [ ] is applied to an image of size [1,M,M,1] to produce a map of size [1,G,G,1], in which every pixel of the generated map corresponds to a likelihood of the corresponding NxN patch to belong to a certain class.
PatchGAN Discriminator: More general architecture, as I can understand the network that is strided through the image outputs a map itself instead of a single value, which then is combined with adjacent maps to produce the final map.
While I cannot find any tensorflow implementation of MTANN, I found the PatchGAN implementation, which is considered as a convolutional network, but I couldn't figure out how to implement this in practice.
Let's say I got a pre-trained network of which I got the output tensor. I understand that convolution is the way to go, since a convolutional layer operates over a local region of the input and what is I'm trying to do can be clearly represented as a convolutional network. However, what if I already have the network that generates the sub-maps from a given window of fixed-size?
E.g. I got a tensor
sub_map = network(input_patch)
which returns a [1,2,2,1] maps from a [1,8,8,3] image (corresponding to a 3-layer FCN with input size 8, filter size 3x3).
How can I sweep this network on [1,64,64,3] images, in order to produce a [1,64,64,1] map composed of each spatial contribution, like it happens in a convolution?
I've considered these solutions:
Using tf.image.extract_image_patches which explicitly extract all the image patches and channels in the depth dimension, but I think it would consume too many resources, as I'm switching to PatchGAN Discriminator from a full convolutional network due to memory constraints - also the composition of the final map is not so straight-forward.
Adding a convolutional layer before the network I got, but I cannot figure out what the filter (and its size) should be in this case in order to keep the pretrained model work on 8x8 images while integrating it in a model which works on bigger images.
For what I can get it should be something like whole_map = tf.nn.convolution(input=x64_images, filter=sub_map, ...) but I don't think this would work as the filter is an operator which depends on the receptive field itself.
The ultimate goal is to apply this small network to big images (eg. 1024x1024) in an efficient way, since my current model downscales progressively the images and wouldn't fit in memory due to the huge number of parameters.
Can anyone help me to get a better understanding of what I am missing?
Thank you
I found an interesting video by Andrew Ng exactly on how to implement a sliding window using a convolutional layer.
The problem here was that I was thinking at the number of layers as a variable that is dependent on a fixed input/output shape, while it should be the opposite.
In principle, a saved model should only contain the learned filters for each level and as long as the filter shapes are compatible with the layers' input/output depth. Thus, applying a different (ie. bigger) spatial resolution to the network input produces a different output shape, which can be seen as an application of the neural network to a sliding windows sweeping across the input image.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

deep learning for shape localization and recognition

There is a set of images, each of which contains different shape entities, such as shown in the following figure. I am trying to localize and recognize these different shapes. For instance, adding a bounding box for each different shape and maybe even label it. What are the major research papers/deep learning models that have been able to solve this kind of problem?
Object detection papers such as rcnn, faster rcnn, yolo and ssd would help you solve this if you were bent on using a deep learning approach.
It’s easy to say this is a trivial problem that can be solved with tools in OpenCV and deep learning is overkill, but I can see many reasons to use deep learning tools and that does not answer your question.
We assume that your shapes has different scales and rotations. Actually your main image shown above is very large for training process and it needs a lot of training samples to generate a good accuracy at the end on test samples. In this case it is better to train a Convolutional Neural Network on a short images (like 128x128) with only one shape per each image and then use slide trick!
This project will have three main steps:
Generate test and train samples, each image should have only one shape
Train a classifier to recognize a single shape within each input image
Use slide trick! Break your original image containing many shapes to overlapping blocks of size 128x128. Pass each block to your model trained in the second step.
In this way at the end you will have label for each shape from your trained model, and also you will have location of each shape using slide trick.
For the classifier you can use exactly CNN structure of Tensorflow's MNIST tutorial.
Here is a paper with exactly same method applied to finger print images to extract local features.
A direct fingerprint minutiae extraction approach based on convolutional neural networks