Could anyone make sure my reasoning?
Let's say I have a (pre-trained) fully connected layer fc that takes bx20x20x10 as input and bx64as output layer, where b is batch size.
Now, I have an input of cx100x60x10. The height and weight 100x60 can be subdivided into 5x3 of 20x20. I would like to have 5x3 of local response (output) by fc layer, i.e., `cx5x3x64'.
Now I am thinking: doing this is same with having convolution layer with fc weights and stride with width 20 and height 20. Is that correct? There can be difference?

Yes, it will be the same if appropriate reshaping of the dense layer weight matrix is performed.
Let us first look at the dense layer. You input a 20 x 20 x 10 matrix to the dense layer. It will first be flattened out to produce a 4000 x 1 vector. You want the output to be of size 64 x 1 vector. So, the weight matrix required is 4000 x 64 and 64 bias parameters. Then y = w^T * x + b = [4000 x 64]^T * [4000 x 1] + [64 x 1] will yield a [64 x 1] vector. Therefore, y[i] = w[i][0]*x[0] + ... + w[i][3999]*x[3999] + b[i] for i = [0, 63]. Note that b indicates a bias parameter.
Let us turn to convolution. To produce a 5 x 3 x 64 output from an input of size 100 x 60 x 10, you need 64 filters, each of size (20,20) and strides (20,20) with no zero-padding. Each 20 x 20 filter however has local connectivity extending along the entire depth i.e. a neuron is connected to all the 10 dimensions along the depth of input. Please read this for more information on local connectivity of convolutional layer.
You convolutional layer has a receptive field of 20 x 20. Each neuron in the convolutional layer will be connected to a 20 x 20 x 10. Thus total 4000 weights (and one bias parameter). You have 64 such filters. Therefore, your total learnable weights for this layer = 4000 x 64 + 64. Convolution between one 20 x 20 x 10 block of x and w (size = 64 x 20 x 20 x 10) can be performed as:
convResult = np.sum(np.sum(np.sum(x*w[:,:,::-1,::-1], axis=-1), axis=-1),axis=-1)
There are some fine points here. I did w[:,:,::-1,::-1] because theano convolution flips the convolution kernel (well, not that simple!). If you are interested in who flips and who does not, read this.
Finally, dense layer and convolution layer (in this context) essentially do the same operation. They first element-wise multiply and then sum up two sets of vectors/matrices of 4000 elements. This procedure is repeated 64 times to produce a 64 x 1 vector. So, it is possible to achieve exactly the same result with dense and convolution layer by proper reshaping of the dense layer weight matrix. However, you need to take care of kernel flipping to match the results.
Below I give a code snippet to compute convolution manually (using numpy) and using Theano.
import theano
from theano import tensor as T
import numpy as np
X = T.ftensor4('X')
W = T.ftensor4('W')
out = T.nnet.conv2d(X,W)
f = theano.function([X, W], out, allow_input_downcast=True)
x = np.random.random((1,10,20,20))
w = np.random.random((64,10,20,20))
# convolution using Theano
c1 = np.squeeze(f(x,w)[0])
# convolution using Numpy
c2 = np.sum(np.sum(np.sum(x*w[:,:,::-1,::-1],axis=-1),axis=-1),axis=-1)
# check that both are almost identical
print np.amax(c2 - c1)


Optimizing Tensorflow for many small matrix-vector multiplications

To build up a capsule network training script, I need to compute many small matrix-vector multiplications.
The size of each weight matrix is at most 20 by 20.
The number of weight matrices is more more than 900.
I'm curious tf.matmul or tf.linalg.matvec is the best option for this.
Could anybody give me a hint to optimize the training script?
Looking at the notebook that you are referring to, it seems you have the following parameters:
batch_size = 50
caps1_n_caps = 1152
caps1_n_dims = 8
caps2_n_caps = 10
caps2_n_dims = 16
And then you have a tensor w with shape (caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims) (in the notebook it has an initial dimension with size 1 that I am skipping) and another tensor caps1_output with shape (batch_size, caps1_n_caps, caps1_n_dims). And you need to combine them to produce caps2_predicted with shape (batch_size, caps1_n_caps, caps1_n_dims, caps2_n_dims).
In the notebook they tile the tensors in order to operate them with tf.linalg.matmul, but actually you can compute the same result without any tiling just using tf.einsum:
import tensorflow as tf
batch_size = 50
caps1_n_caps = 1152
caps1_n_dims = 8
caps2_n_caps = 10
caps2_n_dims = 16
w = tf.zeros((caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims), dtype=tf.float32)
caps1_output = tf.zeros((batch_size, caps1_n_caps, caps1_n_dims), dtype=tf.float32)
caps2_predicted = tf.einsum('ijkl,bil->bilk', w, caps1_output)
# (50, 1152, 8, 16)
I'm not sure if I have understood exactly what you want, but you say you want to compute something like:
ûij = Wij × ui
For a collection of several matrices W and vectors u. Assuming you have 900 matrices and vectors, matrices have size 20×20 and vectors have size 20, you can represent them as two tensors, ws, with shape (900, 20, 20), and us, with shape (900, 20). If you do that, you result us_hat, with shape (900, 20, 20), would be computed simply as:
us_hat = ws * tf.expand_dims(us, axis=-1)

Setting filter weights of a convolutional layer

Im working on a semantic segmentation project which involves dynamic filters in order to learn multiscale representations.
To create these filters I use a Unet backbone and extract the feature maps from the bottleneck layer.
The feature maps are of size H x W X 512, where H is the height of the feature map, W the width and 512 is the number of channels (maps).
These features are passed to a 1x1 convolution to reduce the amount of filters to H X W X 128 and the features are also passed to an adaptive pooling layer to reduce H X W X 512 to k x k x 512, where k is the size of the filter (i.ex. 5).
The filter is then also fed through a 1 x 1 convolution to reduce it to 128.
This gives me a feature map f = H x W x 128 and a filter kernel g of size k x k x 128.
Now I want to convolve f with g and tried the following in keras:
conv = Conv2D(128, kernel_size = 5, kernel_initializer = g, trainable = False)(f)
Unfortunately this does not work and I just get an error saying:
"Could not interpret initializer identifier: Tensor("strided_slice:0", shape = (5,5,128), dtype = float32)"
Now Iam wondering what Iam doing wrong?
In addition I have to mention that the shape of the output tnesor after average pooling /1x1 conv is (? , 5, 5, 128), where ? is the batch size.
The get the kernel I tried something like:
g = g[0,:,:,:]
Thanks for any advice,
The kernel_initializer argument of the constructor of Conv2D does not expect a kernel, but a function that would initialize a kernel. You can read more in the documentation
If you just want to perform a convolution without trainable weights, you are better off using the tensorflow native function tf.nn.conv2d :
conv = tf.nn.conv2d(f,g,strides=[1,1,1,1],padding='VALID')

Is this Neural Net example I'm looking at a mistake or am I not understanding backprop?

Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?
In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred
The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?
Here is an attempt at a visual illustration of the model.
This example is from the Pytorch website. It is the second block of code on the page.
grad_w2 = h_relu.t().mm(grad_y_pred)
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h =
h_relu = h.clamp(min=0)
y_pred =
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu =
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.
Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.
Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.
Using this approach the gradients of L w.r.t. w1 and w2 are
Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.

How to map input image with neurons in first conv layer in CNN?

I just completed ANN course and started learning CNN. I have basic understanding of padding and stride operation works in CNN.
But have difficultly in mapping input image with neurons in first conv layer but i have basic
understanding of how input features are mapped to first hidden layer in ANN.
What is best way of understanding mapping between input image with neurons in first conv layer?
How can I clarify my doubts about the below code example? Code is taken from DL course in Coursera.
def initialize_parameters():
Initializes weight parameters to build a neural network with tensorflow. The shapes are:
W1 : [4, 4, 3, 8]
W2 : [2, 2, 8, 16]
parameters -- a dictionary of tensors containing W1, W2
tf.set_random_seed(1) # so that your "random" numbers match ours
### START CODE HERE ### (approx. 2 lines of code)
W1 = tf.get_variable("W1",[4,4,3,8],initializer = tf.contrib.layers.xavier_initializer(seed = 0))
W2 = tf.get_variable("W2",[2,2,8,16],initializer = tf.contrib.layers.xavier_initializer(seed = 0))
parameters = {"W1": W1,
"W2": W2}
return parameters
def forward_propagation(X, parameters):
Implements the forward propagation for the model:
X -- input dataset placeholder, of shape (input size, number of examples)
parameters -- python dictionary containing your parameters "W1", "W2"
the shapes are given in initialize_parameters
Z3 -- the output of the last LINEAR unit
# Retrieve the parameters from the dictionary "parameters"
W1 = parameters['W1']
W2 = parameters['W2']
# CONV2D: stride of 1, padding 'SAME'
Z1 = tf.nn.conv2d(X,W1, strides = [1,1,1,1], padding = 'SAME')
A1 = tf.nn.relu(Z1)
# MAXPOOL: window 8x8, sride 8, padding 'SAME'
P1 = tf.nn.max_pool(A1, ksize = [1,8,8,1], strides = [1,8,8,1], padding = 'SAME')
# CONV2D: filters W2, stride 1, padding 'SAME'
Z2 = tf.nn.conv2d(P1,W2, strides = [1,1,1,1], padding = 'SAME')
A2 = tf.nn.relu(Z2)
# MAXPOOL: window 4x4, stride 4, padding 'SAME'
P2 = tf.nn.max_pool(A2, ksize = [1,4,4,1], strides = [1,4,4,1], padding = 'SAME')
P2 = tf.contrib.layers.flatten(P2)
# FULLY-CONNECTED without non-linear activation function (not not call softmax).
# 6 neurons in output layer. Hint: one of the arguments should be "activation_fn=None"
Z3 = tf.contrib.layers.fully_connected(P2, 6,activation_fn=None)
return Z3
with tf.Session() as sess:
X, Y = create_placeholders(64, 64, 3, 6)
parameters = initialize_parameters()
Z3 = forward_propagation(X, parameters)
init = tf.global_variables_initializer()
a =, {X: np.random.randn(1,64,64,3), Y: np.random.randn(1,6)})
print("Z3 = " + str(a))
How is this input image of size 64*64*3 is processed by 8 filter of each size 4*4*3?
stride = 1, padding = same and batch_size = 1.
What I have understood till now is each neuron in first conv layer will have 8 filters and each of them having size 4*4*3. Each neuron in first convolution layer will take portion of the input image which is same as filter size (which is here 4*4*3) and apply the convolution operation and produces eight 64*64 features mapping.
If my understanding is correct then:
1> Why we need striding operation since kernel size and portion input image proceed by each neuron is same, If we apply stride = 1(or 2) then boundary of portion of input image is cross which is something we don't need right ?
2> How do we know which portion of input image (same as kernel size) is mapped which neuron in first conv layer?
If not then:
3> How input image is passed on neurons in first convolution layer, Is is complete input image is passed on to each neuron (Like in fully connected ANN, where all the input features are mapped to each neuron in first hidden layer)?
Or portion of input image ? How do we know which portion of input image is mapped which neuron in first conv layer?
4> Number of kernel specified above example (W1= [4, 4, 3, 8]) is per neuron or total number of kernel in fist conv layer ?
5> how do we know how may neurons used by above example in first convolution layer.
6> Is there any relationship between number of neurons and number of kernel first conv layer.
I found relevant answers to my questions and posting same here.
First of all concept of neuron is exist in conv layer as well but it's indirectly. Basically each neuron in conv layer deals with portion of input image which is same as the size of the kernel used in that conv layer.
Each neuron will focus on only particular portion of input image (Where in fully-connected ANN each neuron focus on whole image) and each neuron use n number of filters/kernels to get more insight of particular portion of image.
These n filters/kernels shared by all the neurons in given conv layer. Because of these weight(kernel/filter) sharing nature conv layer will have less number of parameter to learn. Where as in fully connected ANN network each neuron as it's own weight matrix and hence number of parameter to learn is more.
Now the number of neurons in given conv layer 'L' is depends on input_size (output of previous layer L-1), Kernel_size used in layer L , Padding used in layer L and Stride used in layer L.
Now let answer each of the questions specified above.
1> How do we know which portion of input image (same as kernel size) is mapped which neuron in first conv layer?
From above code example for conv layer 1:
Batch size = 1
Input image size = 64*64*3
Kernel size = 4*4*3 ==> Taken from W1
Number of kernel = 8 ==> Taken from W1
Padding = same
stride = 1
Stride = 1 means that you are sliding the kernel one pixel at a time. Let's consider x axis and number pixels 1, 2, 3 4 ... and 64.
The first neuron will see pixels 1 2,3 and 4, then the kernel is shifted by one pixel and the next neuron will see pixels 2 3, 4 and 5 and last neuron will see pixels 61, 62, 63 and 64 This happens if you use valid padding.
In case of same padding, first neuron will see pixels 0, 1, 2, and 3, the second neuron will see pixels 1, 2, 3 and 4, the last neuron will see pixels 62,63, 64 and (one zero padded).
In case the same padding case, you end up with the output of the same size as the image (64 x 64 x 8). In the case of valid padding, the output is (61 x 61 x 8).
Where 8 in output represent the number of filters.
2> How input image is passed on neurons in first convolution layer, Is is complete input image is passed on to each neuron (Like in fully connected ANN, where all the input features are mapped to each neuron in first hidden layer)?
Neurons looks for only portion of input image, Please refer the first question answer you will be able map between input image and neuron.
3> Number of kernel specified above example (W1= [4, 4, 3, 8]) is per neuron or total number of kernel in fist conv layer ?
It's total number of kernels for that layer and all the neuron i that layer will share same kernel for learning different portion of input image. Hence in convnet number of parameter to be learn is less compare to fully-connected ANN.
4> How do we know how may neurons used by above example in first convolution layer ?
It depends on input_size (output of previous layer L-1), Kernel_size used in layer L , Padding used in layer L and Stride used in layer L. Please refer first question answer above for more clarification.
5> Is there any relationship between number of neurons and number of kernel first conv layer
There is no relationship with respect numbers, But each neuron uses n number of filters/kernel (these kernel are shared among all the neurons in particular layer)to learn more about particular portion of input image.
Below sample code will help us clarify the internal implementation of convolution operation.
def conv_forward(A_prev, W, b, hparameters):
Implements the forward propagation for a convolution function
A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
b -- Biases, numpy array of shape (1, 1, 1, n_C)
hparameters -- python dictionary containing "stride" and "pad"
Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
cache -- cache of values needed for the conv_backward() function
# Retrieve dimensions from A_prev's shape (≈1 line)
(m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
# Retrieve dimensions from W's shape (≈1 line)
(f, f, n_C_prev, n_C) = W.shape
# Retrieve information from "hparameters" (≈2 lines)
stride = hparameters['stride']
pad = hparameters['pad']
# Compute the dimensions of the CONV output volume using the formula given above. Hint: use int() to floor. (≈2 lines)
n_H = int(np.floor((n_H_prev-f+2*pad)/stride)) + 1
n_W = int(np.floor((n_W_prev-f+2*pad)/stride)) + 1
# Initialize the output volume Z with zeros. (≈1 line)
Z = np.zeros((m,n_H,n_W,n_C))
# Create A_prev_pad by padding A_prev
A_prev_pad = zero_pad(A_prev,pad)
for i in range(m): # loop over the batch of training examples
a_prev_pad = A_prev_pad[i] # Select ith training example's padded activation
for h in range(n_H): # loop over vertical axis of the output volume
for w in range(n_W): # loop over horizontal axis of the output volume
for c in range(n_C): # loop over channels (= #filters) of the output volume
# Find the corners of the current "slice" (≈4 lines)
vert_start = h*stride
vert_end = vert_start+f
horiz_start = w*stride
horiz_end = horiz_start+f
# Use the corners to define the (3D) slice of a_prev_pad (See Hint above the cell). (≈1 line)
a_slice_prev = a_prev_pad[vert_start:vert_end,horiz_start:horiz_end,:]
# Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
Z[i, h, w, c] = conv_single_step(a_slice_prev,W[:,:,:,c],b[:,:,:,c])
return Z
A_prev = np.random.randn(1,64,64,3)
W = np.random.randn(4,4,3,8)
#Don't worry about bias , tensorflow will take care of this.
b = np.random.randn(1,1,1,8)
hparameters = {"pad" : 1,
"stride": 1}
Z = conv_forward(A_prev, W, b, hparameters)

How does a 1D multi-channel convolutional layer (Keras) train?

I am working with time series EEG data recorded from 10 individual locations on the body to classify future behavior in terms of increasing heart activity. I would like to better understand how my labeled data corresponds to the training inputs.
So far, several RNN configurations as well as countless combinations of vanilla dense networks have not gotten me great results and I'd figure a 1D convnet is worth a try.
The things I'm having trouble understanding are:
1.) Feeding data into the model.
orig shape = (30000 timesteps, 10 channels)
array fed to layer = (300 slices, 100 timesteps, 10 channels)
Are the slices separated by 1 time step, giving me 300 slices of timesteps at either end of the original array, or are they separated end to end? If the second is true, how could I create an array of (30000 - 100) slices separated by one ts and is also compatible with the 1D CNN layer?
2) Matching labels with the training and testing data
My understanding is that when you feed in a sequence of train_x_shape = (30000, 10), there are 30000 labels with train_y_shape = (30000, 2) (2 classes) associated with the train_x data.
So, when (300 slices of) 100 timesteps of train_x data with shape = (300, 100, 10) are fed into the model, does the label value correspond to the entire 100 ts (one label per 100 ts, with this label being equal to the last time step's label), or are each 100 rows/vectors in the slice labeled- one for each ts?
Train input:
train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
n_timesteps = 100
n_channels = 10
layer : model.add(Convolution1D(filters = n_channels * 2, padding = 'same', kernel_size = 3, input_shape = (n_timesteps, n_channels)))
final layer : model.add(Dense(2, activation = 'softmax'))
I use categorical_crossentropy for loss.
Answer 1
This will really depend on "how did you get those slices"?
The answer is totally dependent on what "you're doing". So, what do you want?
If you have simply reshaped (array.reshape(...)) the original array from shape (30000,10) to shape (300,100,10), the model will see:
300 individual (and not connected) sequences
100 timesteps in each sequence
Sequence 1 goes from step 0 to 299;
Sequence 2 goes from step 300 to 599 and so on.
Creating overlapping slices - Sliding window
If you want to create sequences shifted by only one timestep, make a loop for that.
import numpy as np
originalSequence = someArrayWithShape((30000,10))
newSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newSlices = np.asarray(newSlices)
Beware: if you do this in the input data, you will have to do a similar thing in your output data as well.
Again, that's totally up to you. What do you want to achieve?
Convolutional layers will keep the timesteps with these options:
If you use padding='same', the final length will be the same as the input
If you don't, the final length will be reduced depending on the kernel size you choose
Recurrent layers will keep the timesteps or not depending on:
Whether you use return_sequences=True - Output has timesteps
Or you use return_sequences=False - Output has no timesteps
If you want only one output for each sequence (not per timestep):
Recurrent models:
Use LSTM(...., return_sequences=True) until the last LSTM
The last LSTM will be LSTM(..., return_sequences=False)
Convolutional models:
At some point after the convolutions, choose one of these to add:
Flatten (but treat the number of channels later with a Dense(2)
I think I'd go with GlobalMaxPooling2D if using convoltions, but recurrent models seem better for this. (Not a rule, though).
You can choose to use intermediate MaxPooling1D layers to gradually reduce the length from 100 to 50, then to 25 and so on. This will probably reach a better output.
Remember to keep X and Y paired:
import numpy as np
train_x = someArrayWithShape((30000,10))
train_y = someArrayWithShape((30000,2))
newXSlices = [] #empty list
newYSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newXSlices = np.asarray(newXSlices)
newYSlices = np.asarray(newYSlices)