I’m trying to understand the NASNet-A architecture in detail, but can’t match the parameter counts in the paper.
For example, the paper says CIFAR-10 NASNet-A “6 # 768” model has 3.3M params, but by my calculations a single “sep 5x5” primitive in the final cell should alone have 2.9M params… which can’t be right!
Here’s how I derive this count…
The “6 # 768” notation means the “number of filters in the penultimate layer of the network” is 768, which I assume means the number of filters in each of the primitive operations in the cell is 768, and therefore the output depth of the concat operation (with 5 block inputs) is 5 * 768. Since shape is only changed by reduction cells, the input to the final cell (concat output from prior normal cell) will also be of depth 5 * 768.
So for a 5x5 separable convolution with 5 * 768 input channels and 768 output channels, the number of parameters is:
5x5x1 * (5 * 768) = 96,00 params for the 5x5 depthwise filters, plus
1x1x(5 * 768) x 768 = 2,949,128 params for the 1x1 pointwise filters
Where am I going wrong?!
The amount of output channels from each operation of cell's block is according to the defined num_conv_filters. In example for CIFAR NASNet-A is 32, and it doubles after each Reduction Cell.
Although they mention they have B=5 blocks and no residual connection it seems they have 6 concatenated chunks of filters, the last seem to come from the previous layer.
See: https://github.com/tensorflow/models/blob/d07447a3e34bc66acd9ba7267437ebe9d15b45c0/research/slim/nets/nasnet/nasnet_utils.py#L309
This is why for example you have 192 feature depth in the first cell:
6*32=192.
You can take a look on the expected depths here:
https://github.com/tensorflow/models/blob/d07447a3e34bc66acd9ba7267437ebe9d15b45c0/research/slim/nets/nasnet/nasnet_test.py#L127
So for example, for the last 5x5 separable convolution you can get:
5x5*768 + 768*128 = 117504 parametes
For more info about the separable convolution:
http://forums.fast.ai/t/how-depthwise-separable-convolutions-work/4249
Related
I am dying to understand one question that I can not find any answer to:
When doing Conv1D on a multivariate time-series - is the KERNEL convolved across ALL dimensions or for each dimension individually? IS the size of the kernel [kernel_size x 1] or [kernel_size x num_dims]?
The thing is that I input a 800 by 10 time series into a Conv1D(filter =16,kernel_size=6)
And I get 800 by 16 as output, whereas I would expect to get 800 by 16 by 10 , because each time series dimension is convolved with the filter individually.
What is the case?
Edit: Toy example for discussion:
We have a 3 input channels, 800 time steps long. We have a kernel of 6 time steps width meaning the effective kernel dimensions are [3,1,6].
Each time step, 6 timesteps in each channel are convolved with the kernel. Then all the kernels elements are summed.
If this is correct - what is 1D about this convolution, if the image of the convolution operation is clearly 2-dimensional with [3 x 6] ?
When you convolve an "image" with multiple channels you sum across all the channels and then stack up filters you use to get a new "image" with (# of filters) channels. The thing that's a bit difficult for some people to understand is that the filter itself is actually (kernel_size x 1 x Number of channels). In other words your filters have depth.
So given that you're inputting this as a 800 x 1 "image" with 10 channels, you will end up with an 800 x 1 x 16 image, since you stack 16 filters. Of course the 1s aren't really important for conv1d and can be ignored, so tl;dr 800 x 6 -> 800 x 16 in this case.
Response to part 2:
We have a 3 input channels, 800 time steps long. We have a kernel of 6 time steps width meaning the effective kernel dimensions are [3,1,6].
This is essentially correct.
Each time step, 6 timesteps in each channel are convolved with the kernel. Then all the kernels elements are summed.
Yes, this is essentially correct. We end up with a slightly smaller image as we'll repeat this operation each time we slide the kernel over this timestep, giving us a 700 and something by 1 by 1 new image. We the repeat this operation # of filters times, and stack these on top of each other. This is still in the third dimension, so we end up with 7xx by 1 by (# of filters).
If this is correct - what is 1D about this convolution, if the image of the convolution operation is clearly 2-dimensional with [3 x 6] ?
For something to require Conv2d, it needs to have a 2nd dimension value greater than 1. For example, a color photograph might be 224 x 224 and have 3 color channels so it'd be 224 x 224 by 3.
Notably when we perform Conv2D, we also are sliding our kernel in an additional direction, for example, up and down. This is not required when you simply add more channels, since they're just added to the sum for that cell. Since we're only sliding on one axis in your example (time), we only need Conv1D.
I want to train a Keras DNN for video frame prediction:
Input: 4 consecutive Frames of a Video
Output: Next frame, predicted from the network
So, basically, the dimensions are: input: (number_samples, 4, 60, 60), output: (number_samples, 1, 60, 60). I need some help in getting from the 4 frames in the input down to 1 frame in the output.
I've found an example here and would like to work with that.
Problem is, in that network, the output is not one frame, but the same number of frames as the input. (so my task is actually simpler, because I want to generate only one next frame, not 4). Now I don't know, which layers I could append at the end of the network or how I could change the network, so the output dimensions are as desired. (one frame instead of 4).
Appending a Conv2D Layer at the end did not work, because it does not match with the dimensions of the Conv3D.
Any Ideas on how to go about that problem and how my network architecture could look like? Any advice on my task in general and how I could build a good network for it is also appreciated.
This loop in the code example (for which you gave the URL) can be tailored to do what you desire.
for j in range(16):
new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
new = new_pos[::, -1, ::, ::, ::]
track = np.concatenate((track, new), axis=0)
I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
https://gist.github.com/melgor/0e43cadf742fe3336148ab64dd63138f
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here: http://tf-unet.readthedocs.io/en/latest/_modules/tf_unet/layers.html
(crop_and_concat).
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.
So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original
In tensorflow.contrib.seq2seq's AttentionWrapper, what does "depth" refer to as stated in the attention_layer_size documentation? When the documentation says to "use the context as attention" if the value is None, what is meant by "the context"?
In Neural Machine Translation by Jointly Learning to Align and Translate they give a description of the (Bahdanau) attention mechanism; essentially what happens is that you compute scalar "alignment scores" a_1, a_2, ..., a_n that indicate how important each element of your encoded input sequence is at a given moment in time (i.e. which part of the input sentence you should pay attention to right now in the current timestep).
Assuming your (encoded) input sequence that you want to "pay attention"/"attend over" is a sequence of vectors denoted as e_1, e_2, ..., e_n, the context vector at a given timestep is the weighted sum over all of these as determined by your alignment scores:
context = c := (a_1*e_1) + (a_2*e_2) + ... + (a_n*e_n)
(Remember that the a_k's are scalars; you can think of this as an "averaged-out" letter/word in your sentence --- so ideally if your model is trained well, the context looks most similar to the e_i you want to pay attention to the most, but bears a little bit of resemblance to e_{i-1}, e_{i+1}, etc. Intuitively, think of a "smeared-out" input element, if that makes any sense...)
Anyway, if attention_layer_size is not None, then it specifies the number of hidden units in a feedforward layer within your decoder that is used to mix this context vector with the output of the decoder's internal RNN cell to get the attention value. If attention_layer_size == None, it just uses the context vector above as the attention value, and no mixing of the internal RNN cell's output is done. (When I say "mixing", I mean that the context vector and the RNN cell's output are concatenated and then projected to the dimensionality that you specify by setting attention_layer_size.)
The relevant part of the implementation is at this line and has a description of how it's computed.
Hope that helps!