In pytorch, how can I sum some elements, and get a tensor of smaller shape? - sum

Specifically I have a tensor of dimension 298x160x160 (faces in 298 frames), I need to sum every 4x4 element in last two dimesnion so that I can get a 298x40x40 tensor.
How can I achieve that?

You could create a Convolutional layer with a single 4x4 channel and set its weights to 1, with a stride of 4 (also see Conv2D doc):
a = torch.ones((298,160,160))
# add a dimension for the channels. Conv2D expects the input to be : (N,C,H,W)
# where N=number of samples, C=number of channels, H=height, W=width
a = a.unsqueeze(1)
a.shape
Out: torch.Size([298, 1, 160, 160])
with torch.no_grad(): # I assume you don't need to backprop, otherwise remove this check
m = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=4,stride=4,bias=False)
# set the kernel values to 1
m.weight.data = m.weight.data * 0. + 1.
# apply the kernel and squeeze the channel dim out again
res = m(a).squeeze()
res.shape
Out: torch.Size([298, 40, 40])

Related

Two input layers for LSTM Neural Network?

I am now building a neural network, and I am facing the task of adding another input layer (since now I just needed one).
In particular, this was the code previously:
###...
if(self.net_embedding==0):
l_input = Input(shape=self.win_size, dtype='int32', name='input_act')
emb_input = Embedding(output_dim=params["output_dim_embedding"], input_dim=unique_events + 1, input_length=self.win_size)(l_input)
toBePassed=emb_input
elif(self.net_embedding==1):
self.getWord2VecEmbeddings(params['word2vec_size'])
X_train=self.encodePrefixes(params['word2vec_size'],X_train)
l_input = Input(shape = (self.win_size, params['word2vec_size']), name = 'input_act')
toBePassed=l_input
l1 = LSTM(params["shared_lstm_size"],return_sequences=True, kernel_initializer='glorot_uniform',dropout=params['dropout'])(toBePassed)
l1 = BatchNormalization()(l1)
#and so on with the rest of the layers...
The input of the model (X_train) was just an array of arrays (with size = self.win_size) of integers (e.g. [[0 1 2 3] [1 2 3 4]...] if self.win_size = 4), where the integers represent categorical elements.
As you can see, I also have two types of embeddings for this input:
Embedding layer
Word2Vec encoding
Now, I need to add another input to the net, which is as well an array of arrays (with size = self.win_size again) of integers (eg. [[0 123 334 2212][123 334 2212 4888]...], but this time I don't need to apply any embedding (I think) because the elements here are not categorical (they represent elapsed time in seconds).
I tried by simply changing the net to:
#...
if(self.net_embedding==0):
l_input = Input(shape=self.win_size, dtype='int32', name='input_act')
emb_input = Embedding(output_dim=params["output_dim_embedding"], input_dim=unique_events + 1, input_length=self.win_size)(l_input)
toBePassed=emb_input
elif(self.net_embedding==1):
self.getWord2VecEmbeddings(params['word2vec_size'])
X_train=self.encodePrefixes(params['word2vec_size'],X_train)
l_input = Input(shape = (self.win_size, params['word2vec_size']), name = 'input_act')
toBePassed=l_input
elapsed_time_input = Input(shape=self.win_size, name='input_time')
input_concat = Concatenate(axis=1)([toBePassed, elapsed_time_input])
l1 = LSTM(params["shared_lstm_size"],return_sequences=True, kernel_initializer='glorot_uniform',dropout=params['dropout'])(input_concat)
l1 = BatchNormalization()(l1)
#and so on with other layers...
but I get the error:
ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 4, 12), (None, 4)]
Do you please have any solution for this? Any kind of help would be really appreciated, since I have a deadline in a few days and I'm smashing my head on this for so long now! Thanks :)
There are two problems with your approach.
First, inputs to LSTM should have a shape of (batch_size, num_steps, num_feats), yet your elapsed_time_input has shape (None, 4). You need to expand its dimension to get the proper shape (None, 4, 1).
elapsed_time_input = tf.keras.layers.Reshape((-1, 1))(elapsed_time_input)
or
elapsed_time_input = tf.expand_dims(elapsed_time_input, axis=-1)
With this, "elapsed time in seconds" will be seen as just another feature of a timestep.
Secondly, you'll want to concatenate the two inputs in the feature dimension (not the timestep dimension).
input_concat = Concatenate(axis=-1)([toBePassed, elapsed_time_input])
or
input_concat = Concatenate(axis=2)([toBePassed, elapsed_time_input])
After this, you'll get a keras tensor with a shape of (None, 4, 13). It represents a batch of time series, each having 4 timesteps and 13 features per step (12 original features + elapsed time in second for each step).

How to map input image with neurons in first conv layer in CNN?

I just completed ANN course and started learning CNN. I have basic understanding of padding and stride operation works in CNN.
But have difficultly in mapping input image with neurons in first conv layer but i have basic
understanding of how input features are mapped to first hidden layer in ANN.
What is best way of understanding mapping between input image with neurons in first conv layer?
How can I clarify my doubts about the below code example? Code is taken from DL course in Coursera.
def initialize_parameters():
"""
Initializes weight parameters to build a neural network with tensorflow. The shapes are:
W1 : [4, 4, 3, 8]
W2 : [2, 2, 8, 16]
Returns:
parameters -- a dictionary of tensors containing W1, W2
"""
tf.set_random_seed(1) # so that your "random" numbers match ours
### START CODE HERE ### (approx. 2 lines of code)
W1 = tf.get_variable("W1",[4,4,3,8],initializer = tf.contrib.layers.xavier_initializer(seed = 0))
W2 = tf.get_variable("W2",[2,2,8,16],initializer = tf.contrib.layers.xavier_initializer(seed = 0))
### END CODE HERE ###
parameters = {"W1": W1,
"W2": W2}
return parameters
def forward_propagation(X, parameters):
"""
Implements the forward propagation for the model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
Arguments:
X -- input dataset placeholder, of shape (input size, number of examples)
parameters -- python dictionary containing your parameters "W1", "W2"
the shapes are given in initialize_parameters
Returns:
Z3 -- the output of the last LINEAR unit
"""
# Retrieve the parameters from the dictionary "parameters"
W1 = parameters['W1']
W2 = parameters['W2']
### START CODE HERE ###
# CONV2D: stride of 1, padding 'SAME'
Z1 = tf.nn.conv2d(X,W1, strides = [1,1,1,1], padding = 'SAME')
# RELU
A1 = tf.nn.relu(Z1)
# MAXPOOL: window 8x8, sride 8, padding 'SAME'
P1 = tf.nn.max_pool(A1, ksize = [1,8,8,1], strides = [1,8,8,1], padding = 'SAME')
# CONV2D: filters W2, stride 1, padding 'SAME'
Z2 = tf.nn.conv2d(P1,W2, strides = [1,1,1,1], padding = 'SAME')
# RELU
A2 = tf.nn.relu(Z2)
# MAXPOOL: window 4x4, stride 4, padding 'SAME'
P2 = tf.nn.max_pool(A2, ksize = [1,4,4,1], strides = [1,4,4,1], padding = 'SAME')
# FLATTEN
P2 = tf.contrib.layers.flatten(P2)
# FULLY-CONNECTED without non-linear activation function (not not call softmax).
# 6 neurons in output layer. Hint: one of the arguments should be "activation_fn=None"
Z3 = tf.contrib.layers.fully_connected(P2, 6,activation_fn=None)
### END CODE HERE ###
return Z3
with tf.Session() as sess:
np.random.seed(1)
X, Y = create_placeholders(64, 64, 3, 6)
parameters = initialize_parameters()
Z3 = forward_propagation(X, parameters)
init = tf.global_variables_initializer()
sess.run(init)
a = sess.run(Z3, {X: np.random.randn(1,64,64,3), Y: np.random.randn(1,6)})
print("Z3 = " + str(a))
How is this input image of size 64*64*3 is processed by 8 filter of each size 4*4*3?
stride = 1, padding = same and batch_size = 1.
What I have understood till now is each neuron in first conv layer will have 8 filters and each of them having size 4*4*3. Each neuron in first convolution layer will take portion of the input image which is same as filter size (which is here 4*4*3) and apply the convolution operation and produces eight 64*64 features mapping.
If my understanding is correct then:
1> Why we need striding operation since kernel size and portion input image proceed by each neuron is same, If we apply stride = 1(or 2) then boundary of portion of input image is cross which is something we don't need right ?
2> How do we know which portion of input image (same as kernel size) is mapped which neuron in first conv layer?
If not then:
3> How input image is passed on neurons in first convolution layer, Is is complete input image is passed on to each neuron (Like in fully connected ANN, where all the input features are mapped to each neuron in first hidden layer)?
Or portion of input image ? How do we know which portion of input image is mapped which neuron in first conv layer?
4> Number of kernel specified above example (W1= [4, 4, 3, 8]) is per neuron or total number of kernel in fist conv layer ?
5> how do we know how may neurons used by above example in first convolution layer.
6> Is there any relationship between number of neurons and number of kernel first conv layer.
I found relevant answers to my questions and posting same here.
First of all concept of neuron is exist in conv layer as well but it's indirectly. Basically each neuron in conv layer deals with portion of input image which is same as the size of the kernel used in that conv layer.
Each neuron will focus on only particular portion of input image (Where in fully-connected ANN each neuron focus on whole image) and each neuron use n number of filters/kernels to get more insight of particular portion of image.
These n filters/kernels shared by all the neurons in given conv layer. Because of these weight(kernel/filter) sharing nature conv layer will have less number of parameter to learn. Where as in fully connected ANN network each neuron as it's own weight matrix and hence number of parameter to learn is more.
Now the number of neurons in given conv layer 'L' is depends on input_size (output of previous layer L-1), Kernel_size used in layer L , Padding used in layer L and Stride used in layer L.
Now let answer each of the questions specified above.
1> How do we know which portion of input image (same as kernel size) is mapped which neuron in first conv layer?
From above code example for conv layer 1:
Batch size = 1
Input image size = 64*64*3
Kernel size = 4*4*3 ==> Taken from W1
Number of kernel = 8 ==> Taken from W1
Padding = same
stride = 1
Stride = 1 means that you are sliding the kernel one pixel at a time. Let's consider x axis and number pixels 1, 2, 3 4 ... and 64.
The first neuron will see pixels 1 2,3 and 4, then the kernel is shifted by one pixel and the next neuron will see pixels 2 3, 4 and 5 and last neuron will see pixels 61, 62, 63 and 64 This happens if you use valid padding.
In case of same padding, first neuron will see pixels 0, 1, 2, and 3, the second neuron will see pixels 1, 2, 3 and 4, the last neuron will see pixels 62,63, 64 and (one zero padded).
In case the same padding case, you end up with the output of the same size as the image (64 x 64 x 8). In the case of valid padding, the output is (61 x 61 x 8).
Where 8 in output represent the number of filters.
2> How input image is passed on neurons in first convolution layer, Is is complete input image is passed on to each neuron (Like in fully connected ANN, where all the input features are mapped to each neuron in first hidden layer)?
Neurons looks for only portion of input image, Please refer the first question answer you will be able map between input image and neuron.
3> Number of kernel specified above example (W1= [4, 4, 3, 8]) is per neuron or total number of kernel in fist conv layer ?
It's total number of kernels for that layer and all the neuron i that layer will share same kernel for learning different portion of input image. Hence in convnet number of parameter to be learn is less compare to fully-connected ANN.
4> How do we know how may neurons used by above example in first convolution layer ?
It depends on input_size (output of previous layer L-1), Kernel_size used in layer L , Padding used in layer L and Stride used in layer L. Please refer first question answer above for more clarification.
5> Is there any relationship between number of neurons and number of kernel first conv layer
There is no relationship with respect numbers, But each neuron uses n number of filters/kernel (these kernel are shared among all the neurons in particular layer)to learn more about particular portion of input image.
Below sample code will help us clarify the internal implementation of convolution operation.
def conv_forward(A_prev, W, b, hparameters):
"""
Implements the forward propagation for a convolution function
Arguments:
A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
b -- Biases, numpy array of shape (1, 1, 1, n_C)
hparameters -- python dictionary containing "stride" and "pad"
Returns:
Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
cache -- cache of values needed for the conv_backward() function
"""
# Retrieve dimensions from A_prev's shape (≈1 line)
(m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
# Retrieve dimensions from W's shape (≈1 line)
(f, f, n_C_prev, n_C) = W.shape
# Retrieve information from "hparameters" (≈2 lines)
stride = hparameters['stride']
pad = hparameters['pad']
# Compute the dimensions of the CONV output volume using the formula given above. Hint: use int() to floor. (≈2 lines)
n_H = int(np.floor((n_H_prev-f+2*pad)/stride)) + 1
n_W = int(np.floor((n_W_prev-f+2*pad)/stride)) + 1
# Initialize the output volume Z with zeros. (≈1 line)
Z = np.zeros((m,n_H,n_W,n_C))
# Create A_prev_pad by padding A_prev
A_prev_pad = zero_pad(A_prev,pad)
for i in range(m): # loop over the batch of training examples
a_prev_pad = A_prev_pad[i] # Select ith training example's padded activation
for h in range(n_H): # loop over vertical axis of the output volume
for w in range(n_W): # loop over horizontal axis of the output volume
for c in range(n_C): # loop over channels (= #filters) of the output volume
# Find the corners of the current "slice" (≈4 lines)
vert_start = h*stride
vert_end = vert_start+f
horiz_start = w*stride
horiz_end = horiz_start+f
# Use the corners to define the (3D) slice of a_prev_pad (See Hint above the cell). (≈1 line)
a_slice_prev = a_prev_pad[vert_start:vert_end,horiz_start:horiz_end,:]
# Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
Z[i, h, w, c] = conv_single_step(a_slice_prev,W[:,:,:,c],b[:,:,:,c])
return Z
A_prev = np.random.randn(1,64,64,3)
W = np.random.randn(4,4,3,8)
#Don't worry about bias , tensorflow will take care of this.
b = np.random.randn(1,1,1,8)
hparameters = {"pad" : 1,
"stride": 1}
Z = conv_forward(A_prev, W, b, hparameters)

Hidden state tensors have a different order than the returned tensors

As part of GRU training, I want to retrieve the hidden state tensors.
I have defined a GRU with two layers:
self.lstm = nn.GRU(params.vid_embedding_dim, params.hidden_dim , 2)
The forward function is defined as follows (the following is just a part of the implementation):
def forward(self, s, order, batch_size, where, anchor_is_phrase = False):
"""
Forward prop.
"""
# s is of shape [128 , 1 , 300] , 128 is batch size
output, (a,b) = self.lstm(s.cuda())
output.data.contiguous()
And out is of shape: [128 , 400] (128 is the number of samples which each one is embedded in 400 dimensional vector).
I understand that out is the output of the last hidden state and thus I expect it to be equal to b. However, after I checked the values I saw that it's indeed equal but b contains the tensor in a different order, that is for example output[0] is b[49]. Am I missing something here ?
Thanks.
I understand your confusion. Have a look the example bellow and the comments:
# [Batch size, Sequence length, Embedding size]
inputs = torch.rand(128, 5, 300)
gru = nn.GRU(input_size=300, hidden_size=400, num_layers=2, batch_first=True)
with torch.no_grad():
# output is all hidden states, for each element in the batch of the last layer in the RNN
# a is the last hidden state of the first layer
# b is the last hidden state of the second (last) layer
output, (a, b) = gru(inputs)
If we print out the shapes, they will confirm our understanding:
print(output.shape) # torch.Size([128, 5, 400])
print(a.shape) # torch.Size([128, 400])
print(b.shape) # torch.Size([128, 400])
Also, we can test whether the last hidden state, for each element in the batch, of the last layer, obtained from output is equal to b:
np.testing.assert_almost_equal(b.numpy(), output[:,:-1,:].numpy())
Finally, we can create an RNN with 3 layers, and run the same tests:
gru = nn.GRU(input_size=300, hidden_size=400, num_layers=3, batch_first=True)
with torch.no_grad():
output, (a, b, c) = gru(inputs)
np.testing.assert_almost_equal(c.numpy(), output[:,-1,:].numpy())
Again, the assertion passes but only if we do it for c, which is now the last layer of the RNN. Otherwise:
np.testing.assert_almost_equal(b.numpy(), output[:,-1,:].numpy())
Raises an error:
AssertionError: Arrays are not almost equal to 7 decimals
I hope that this makes things clear for you.

Cleaner way to whiten each image in a batch using keras

I would like to whiten each image in a batch. The code I have to do so is this:
def whiten(self, x):
shape = x.shape
x = K.batch_flatten(x)
mn = K.mean(x, 0)
std = K.std(x, 0) + K.epsilon()
r = (x - mn) / std
r = K.reshape(x, (-1,shape[1],shape[2],shape[3]))
return r
#
where x is (?, 320,320,1). I am not keen on the reshape function with a -1 arg. Is there a cleaner way to do this?
Let's see what the -1 does. From the Tensorflow documentation (Because the documentation from Keras is scarce compared to the one from Tensorflow):
If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant.
So what this means:
from keras import backend as K
X = tf.constant([1,2,3,4,5])
K.reshape(X, [-1, 5])
# Add one more dimension, the number of columns should be 5, and keep the number of elements to be constant
# [[1 2 3 4 5]]
X = tf.constant([1,2,3,4,5,6])
K.reshape(X, [-1, 3])
# Add one more dimension, the number of columns should be 3
# For the number of elements to be constant the number of rows should be 2
# [[1 2 3]
# [4 5 6]]
I think it is simple enough. So what happens in your code:
# Let's assume we have 5 images, 320x320 with 3 channels
X = tf.ones((5, 320, 320, 3))
shape = X.shape
# Let's flat the tensor so we can perform the rest of the computation
flatten = K.batch_flatten(X)
# What this did is: Turn a nD tensor into a 2D tensor with same 0th dimension. (Taken from the documentation directly, let's see that below)
flatten.shape
# (5, 307200)
# So all the other elements were squeezed in 1 dimension while keeping the batch_size the same
# ...The rest of the stuff in your code is executed here...
# So we did all we wanted and now we want to revert the tensor in the shape it had previously
r = K.reshape(flatten, (-1, shape[1],shape[2],shape[3]))
r.shape
# (5, 320, 320, 3)
Besides, I can't think of a cleaner way to do what you want to do. If you ask me, your code is already clear enough.

feeding data into an LSTM - Tensorflow RNN PTB tutorial

I am trying to feed data into an LSTM. I am reviewing the code from Tensorflow's RNN tutorial here.
The segment of code of interest is from the reader.py file of the tutorial, in particular the ptb_producer function, which ouputs the X and Y that is used by the LSTM.
raw_data is a list of indexes of words from the ptb.train.txt file. The length of raw_data is 929,589. batch_size is 20, num_steps is 35. Both batch_size and num_steps are based on the LARGEconfig which feeds the data to an LSTM.
I have walked through the code (and added comments for what I've printed) and I understand it up till tf.strided_slice. From the reshape, we have a matrix of indexes of shape (20, 46497).
Strided slice in the first iteration of i, tries to take data from [0, i * num_steps + 1] which is [0,1*35+1] till [batch_size, (i + 1) * num_steps + 1] which is [20, (1+1)*35+1].
Two questions:
where in the matrix is [0,1*35+1] and [20, (1+1)*35+1]? What spots in the (20, 46497) the begin and end in strided_slice is trying to access?
It seems like EVERY iteration of i, will take in data from 0? the very start of the data matrix (20, 46497)?
I guess what I am not understanding is how you would feed data into an LSTM, given the batch size and the num_steps (sequence length).
I have read colahs blog on LSTM and Karpathy's blog on RNN which helps greatly in the understanding of LSTMs, but don't seem to address the exact mechanics of getting data into an LSTM. (maybe I missed something?)
def ptb_producer(raw_data, batch_size, num_steps, name=None):
"""Iterate on the raw PTB data.
This chunks up raw_data into batches of examples and returns Tensors that
are drawn from these batches.
Args:
raw_data: one of the raw data outputs from ptb_raw_data.
batch_size: int, the batch size.
num_steps: int, the number of unrolls.
name: the name of this operation (optional).
Returns:
A pair of Tensors, each shaped [batch_size, num_steps]. The second
element of the tuple is the same data time-shifted to the right by one.
Raises:
tf.errors.InvalidArgumentError: if batch_size or num_steps are too high.
"""
with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
data_len = tf.size(raw_data) # prints 929,589
batch_len = data_len // batch_size # prints 46,497
data = tf.reshape(raw_data[0 : batch_size * batch_len],
[batch_size, batch_len])
#this truncates raw data to a multiple of batch_size=20,
#then reshapes to [20, 46497]. prints (20,?)
epoch_size = (batch_len - 1) // num_steps #prints 1327 (number of epoches)
assertion = tf.assert_positive(
epoch_size,
message="epoch_size == 0, decrease batch_size or num_steps")
with tf.control_dependencies([assertion]):
epoch_size = tf.identity(epoch_size, name="epoch_size")
i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
#for each of the 1327 epoches
x = tf.strided_slice(data, [0, i * num_steps], [batch_size, (i + 1) * num_steps]) # prints (?, ?)
x.set_shape([batch_size, num_steps]) #prints (20,35)
y = tf.strided_slice(data, [0, i * num_steps + 1], [batch_size, (i + 1) * num_steps + 1])
y.set_shape([batch_size, num_steps])
return x, y
where in the matrix is [0,1*35+1] and [20, (1+1)*35+1]? What spots in the (20, 46497) the begin and end in strided_slice is trying to access?
So the data matrix is arranged as 20 x 46497. Think of it as 20 sentences with 46497 characters each. The strided slice in your specific example will get character 36 to 70 (71 is not included, typical range semantics) from each line. This is equivalent to
strided = [data[i][36:71] for i in range(20)]
It seems like EVERY iteration of i, will take in data from 0? the very start of the data matrix (20, 46497)?
This is correct. For each iteration we want batch size elements in first dimension i.e. we want a matrix of size batch size x num_steps. Hence the first dimension always goes from 0-19, essentially looping over each sentence. And second dimension increments in num_steps, getting segments of words from each sentence. If we look at the index it extracts from data at each step for x, we will get something like:
- data[0:20,0:35]
- data[0:20,35:70]
- data[0:20,70:105]
- ...
For y we add 1 because we want next character to be predicted.
So your train, label tuples will be like:
- (data[0:20,0:35], data[0:20, 1:36])
- (data[0:20,35:70], data[0:20,36:71])
- (data[0:20,70:105], data[0:20,71:106])
- ...