The filter vector and its layer function in a Convolutional Neural Network - layer

For the image recognition, there is a thing about the filter vector vs its layer function I didn't get. Many articles mentioned the similar idea: "... to detect edges from raw pixels at the first layer, then use the edges to detect simple shapes at the second layer ... ", and some articles wrote: "the filters are initialized randomly and automatically learned from the data during training."
My question is if the filter values are not arranged in some order in a CNN (i.e., values from randomly learned), how could we know a CNN (always?) learns edges at first, and it detects shapes at the second layer, etc.? Thank you very much!

If the filter vectors are learned from the arbitrary values, which I know they are, how a CNN seems always to learn an image from edges, shapes, and so on? It looks like a CNN could find its own way (or say pattern?) to put the filter vectors down in an order. My guess is that the 'filtering-pooling' process resizes the original image, so the CNN would learn the image features in the hierarchical nature of it.

Related

Is it possible to train a NN in Keras with features that won't be available for prediction?

I'm fairly new to this topic as a whole and struggle to wrap my head even the basics of neural networks in general. Not looking for a project plan, appreciate that you probably have better things to do.
Nonetheless, any idea or push in the right direction is appreciated.
Imaging a grey-box model of some kind, thermal network, electrical network, so on, it's desirable to predict returns based on a very few features with an underlying smart model that is trained with a much bigger dataset.
My question would be if it is possible to train a model with features and define mandatory and some sort of good-to-have features for the predictions?
Any tips are appreciated.
Cheers
Yes, you can train your model like that. But you must feed all the features during prediction. For example, you have 30 mandatory features and 10 optional features. The total is 40. You must feed all the 40 features to get a prediction from your model. Input data shape must be the same always. But we asked for optional features, but why I'm being forced now? well, I will discuss two options.
Option 1: set input shape to None. If you set the input shape to None, your model will accept any shape of input. But if you will have to handle some stuff. You can't use the MaxPooling layer. If you really need to use MaxPool, you will need to calculate input, output shape for all the layers only using the mandatory feature shapes(minimum input shape). If you calculate using (mandatory_feature+optional feature), you will end up with an error. Because the shape of the input of the maxpool layer can become too small and can't be reduced. Take care of that stuff and you're good to go.
Option 2: I will give you an example. I was using the OpenPose output dataset to classify some movements. OpenPose output = 18 bone key points, 36 features including x and y coordinates. These key points were being extracted from live camera frames. But we can't say all the body parts of a human will be always inside the frame. When someone's legs are outside of the frame, we can't get their legs keypoints. But still, we need to classify. There were a lot of options. We could replace the missing keypoints with 0 or find median/mean of such poses and use that value as key points. We found the best case by analyzing all the data. If you're going with option 2, I will suggest you analyze the data first, then decide how you're going to handle the missing field.

visualizing 2nd convolutional layer

So via tf.summary I visualized the first convolutional layer in my model, of shape [5,5,3,32], as a set of individual images, one per filter. so this layer has a filter of 5x5 dimensions, of depth 3, and there are 32 of them. Im viewing these filters as 5x5 color(RGB) images.
Im wondering how to generalize this to a second convolutional layer, and third and on...
shape of the second convolutional layer is [5,5,32,64].
my questions is how would I transform that tensor into individual 5x5x3 images?
with the first conv layer of shape [5,5,3,32] I visualize it by transposing first tf.transpose(W_conv1,(3,0,1,2)), and then having 32 5x5x3 images.
doing a tf.transpose(W_conv2,(3,0,1,2)) would produce a shape [64,5,5,32]. how would I then use those "32 color channels"? (Im know its not that simple :) ).
Visualization of higher level filters is usually done indirectly. To visualize a particular filter, you look for images that the filter would respond the most to. For that you perform gradient ascent in the space of images (instead of changing the parameters of the network like when you train the network, you change the input image).
You will understand it easier if you play with the following Keras code: https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

variable size of input for CNN model in text classification?

I implemented the CNN model for text classification based on this paper. Since the CNN can only deal with the sentences that have fixed size, so I set the size of input as max length of sentence in my dataset and zero padding the short sentence. But for my understanding, no matter how long the input sentence is, the max pooling strategy will always extract only one value for each filter map. So it doesn't matter the size of input sentence is long or short, because after filter convoluted/pooled, the output will be the same size. In this case, why should I zero padding all the short sentence into the fixed size?
For example, my code for feeding data into the CNN model is self.input_data = tf.placeholder(tf.int32,[None,max_len],name="input_data"), can I do not specify max_len, and using the None value which is based on the length of current training sentence?
In addition, I was wondering is there any other new approach that can solve the variable input for CNN model. I also found the other paper that can solve this problem, but for my understanding, it only used k values for max-pooling instead of 1 value of max-pooling, which can deal with variable sentence? How?
Quick answer:
No you can't
Longer answer:
Pooling is like a reduce function. Applying it on a layer reduces the dimensions. But different input shapes don't produce the same output shapes. However with zero padding you can probably simulate this, with max_len we are doing this. So, in the second paper, the idea is to have a dynamic computational graph. It is not the same thing as before. It is basically creating several networks with different depths (depending on their input size). The generalized version for encoder-decoder architecture is called bytenet

tensorflow: batches of variable-sized images

When one passes to tf.train.batch, it looks like the shape of the element has to be strictly defined, else it would complain that All shapes must be fully defined if there exist Tensors with shape Dimension(None). How, then, does one train on images of different sizes?
You could set dynamic_pad=True in the argument of tf.train.batch.
dynamic_pad: Boolean. Allow variable dimensions in input shapes. The given dimensions are padded upon dequeue so that tensors within a batch have the same shapes.
Usually, images are resized to a certain number of pixels.
Depending on your task you might be able to use other techniques in order to process images of varying sizes. For example, for face recognition and OCR, a fix sized window is used, that is then moved over the image. On other tasks, convolutional neural networks with pooling layers or recurrent neural networks can be helpful.
I see that this is quite old question, but in case someone will be searching how variable-size images can be still used in batches, I can tell what I did for Image-to-Image convolutional network (inference), which was trained for variable image size and batch 1. Why: when I tried to process images in batches using padding, the results become much worse, because signal was "spreading" inside of the network and started to influence its convolution pyramids.
So what I did is possible when you have source code and can load weights manually into convolutional layers. I modified the network in the following way: along with a batch of zero-padded images, I added additional placeholder which received a batch of binary masks with 1 where actual data was on the patch, and 0 where padding was applied. Then I multiplied signal by these masks after each convolutional layer inside the network, fighting "spreading". Multiplication isn't expensive operation, so it did not affect performance much.
The result was not deformed already, but still had some border artifacts, so I modified this approach further by adding small (2px) symmetric padding around input images (kernel size of all the layers of my CNN was 3), and kept it during propagation by using slightly bigger (+[2px,2px]) mask.
One can apply the same approach for training as well. Then some sort of "masked" loss is needed, where only the ROI on each patch is used to calculate loss. For example, for L1/L2 loss you can calculate the difference image between generated and label images and apply masks before summing up. More complicated losses might involve unstacking or iterating batch, and extracting ROI using tf.where or tf.boolean_mask.
Such training can be indeed beneficial in some cases, because you can combine small and big inputs for the network without small inputs being affected by the loss of big padded surroundings.