tensorflow : conv2d_transpose : Matching desired output dimensions - tensorflow

How can I force certain dimensionality of the output of the conv2d_transpose layer ? My problem is that I use it for upsampling and I want to match the dimensionality of my labels and the output of the NN, for example if I have a feature map as Bx25x40xC how can I make it Bx100x160xC (i.e. upsample exactly 4x times)?
It seems like dimensions of the output can be calculated using
h = ((h_in - 1) * stride_h) + kernel_h - 2 * pad_h
w = ((w_in - 1) * stride_w) + kernel_w - 2 * pad_w
one can manipulate strides and kernels, but padding is controlled by 'same'/'valid' algorithms which, to my understanding, means they are pretty much uncontrollable, so is the resulting output size. For comparison, in caffe, one can at least force the padding in attempt to match the desired output explicitly.

Related

what these dimensions represent in a neural network?

I am following the course of Andrew Ng on the topic of Deep Learning. In one programming assignment that uses the SIGN dataset. For what I know each image is composed of 64 by 64 pixels of width and height, and another dimension of 3 that corresponds to the RGB channels.
According to the author it says that the value of:
n_x=num_px * num_px = 64 * 64 * 3 = 12288
and having the following data:
number of training examples = 1080
number of test examples = 120
X_train shape: (12288, 1080)
Y_train shape: (6, 1080)
the part that I do not understand is when the author initializes the weights, he says that the shape of W1 (an array of weights) is:
W1 : [25, 12288]
this part I do not get it, why 25 as the number of rows? I get it that the number of columns corresponds to the formula of n_x, but this 25 to what it refers to? is it the number of neurons inside a hidden layer?
Thanks
This looks like 12288 is the number of input nodes and 25 is the number of nodes in the hidden layer.
Thus, the number of weights should be = 25 * 12288 (Each node in Layer(i) is connected to each node in Layer(i+1)), and thus the size of the matrix.

How to handle log(0) when using cross entropy

In order to make the case simple and intuitive, I will using binary (0 and 1) classification for illustration.
Loss function
loss = np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY)) #cross entropy
cost = -np.sum(loss)/m #num of examples in batch is m
Probability of Y
predY is computed using sigmoid and logits can be thought as the outcome of from a neural network before reaching the classification step
predY = sigmoid(logits) #binary case
def sigmoid(X):
return 1/(1 + np.exp(-X))
Problem
Suppose we are running a feed-forward net.
Inputs: [3, 5]: 3 is number of examples and 5 is feature size (fabricated data)
Num of hidden units: 100 (only 1 hidden layer)
Iterations: 10000
Such arrangement is set to overfit. When it's overfitting, we can perfectly predict the probability for the training examples; in other words, sigmoid outputs either 1 or 0, exact number because the exponential gets exploded. If this is the case, we would have np.log(0) undefined. How do you usually handle this issue?
If you don't mind the dependency on scipy, you can use scipy.special.xlogy. You would replace the expression
np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY))
with
xlogy(Y, predY) + xlogy(1 - Y, 1 - predY)
If you expect predY to contain very small values, you might get better numerical results using scipy.special.xlog1py in the second term:
xlogy(Y, predY) + xlog1py(1 - Y, -predY)
Alternatively, knowing that the values in Y are either 0 or 1, you can compute the cost in an entirely different way:
Yis1 = Y == 1
cost = -(np.log(predY[Yis1]).sum() + np.log(1 - predY[~Yis1]).sum())/m
How do you usually handle this issue?
Add small number (something like 1e-15) to predY - this number doesn't make predictions much off, and it solves log(0) issue.
BTW if your algorithm outputs zeros and ones it might be useful to check the histogram of returned probabilities - when algorithm is so sure that something's happening it can be a sign of overfitting.
One common way to deal with log(x) and y / x where x is always non-negative but can become 0 is to add a small constant (as written by Jakub).
You can also clip the value (e.g. tf.clip_by_value or np.clip).

what is the behavior of SAME padding when stride is greater than 1?

My understanding of SAME padding in Tensorflow is that padding is added such that the output dimensions (for width and height) will be the same as the input dimensions. However, this understanding only really makes sense when stride=1, because if stride is >1 then output dimensions will almost certainly be lower.
So I'm wondering what the algorithm is for calculating padding in this case. Is it simply that padding is added so that the filter is applied to every input value, rather than leaving some off on the right?
There is a formula for that:
n' = floor((n+2*p-f)/s + 1)
where n' is the output size, n is the input size, p is the padding and f is the filter size, s will be the stride.
If you are using SAME padding with stride > 1, p will be the minimum number to make (n+2*p-f) divisible by s. Note: p could be decimal as it will be averaged over two sides of the image.
Peter's answer is true but might lack a few details. Let me add on top of it.
Autopadding = SAME means that: o = ceil(i/s), where o = output size, i = input size, s = stride.
In addition, the generic output size formula is:
o = floor( (i + p - k) / s) + 1
Where the new terms are p (pading) and k, i.e., the effective kernel size
(including dilation, or just kernel size if dilation is disabled).
If you develop that formula to solve for p, you get:
p_min = (o-1) s - i + k # i.e., when the floor is removed from the previous equation
p_max = o s - i + k - 1 # i.e., when the numerator of the floor % s is s-1
Any padding value p in the range [p_min, p_max] will satisfy the condition o = ceil(i/s), meaning that for a stride s there are s total solution satisfying the formula.
It is the norm to use p_min as padding, so you can ignore all other s-1 solutions.
PS: This would be for 1D, but for nD, simply repeat these formulas independently for each dimension, i.e.,
p_min[dimension_index] = (o[dimension_index]-1)s[dimension_index] - i[dimension_index] + k[dimension_index]
For references, these 2 links are really useful:
https://arxiv.org/abs/1603.07285
https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215
https://mmuratarat.github.io/2019-01-17/implementing-padding-schemes-of-tensorflow-in-python

How is this function programatically building a LSTM

Here is the code:
def lstm(o, i, state):
#these are all calculated seperately, no overlap until....
#(input * input weights) + (output * weights for previous output) + bias
input_gate = tf.sigmoid(tf.matmul(i, w_ii) + tf.matmul(o,w_io) + b_i)
#(input * forget weights) + (output * weights for previous output) + bias
output_gate = tf.sigmoid(tf.matmul(i, w_oi) + tf.matmul(o,w_oo) + b_o)
#(input * forget weights) + (output * weights for previous output) + bias
forget_gate = tf.sigmoid(tf.matmul(i, w_fi) + tf.matmul(o,w_fo) + b_f)
memory_cell = tf.sigmoid(tf.matmul(i, w_ci) + tf.matmul(o,w_co) + b_c)
state = forget_gate * state + input_gate * memory_cell
output = output_gate * tf.tanh(state)
return output, state
And here is the drawing of the lstm:
I'm having trouble understanding how the two match up. Any help would be much appreciated.
This is an excellent blogpost on LSTMs. This code is directly implementing the LSTM; the code here is equivalent to the equations listed on Wikipedia:
The input & output weights reflect the state of the network. In a simple fully-connected (FC) layer, we'd only have one weight matrix, which is what we would use to calculate the output of the layer:
The advantage of a LSTM, however, is that it includes multiple sources of information, or state; this is what we refer to when we say that a LSTM has memory. We have the output gate, just like the FC layer, but we also have the forget gate, the input gate, the cell state, and the hidden state. These all combine to provide multiple, different, sources of information. The equations show how they come together to produce the output.
In the equations, x is the input, and f_t is the input gate. I would recommend reading the linked blogpost and the Wikipedia article to get an understanding of how the equations implement a LSTM.
The image depicts the input gate providing output to the cell based on the values from previous cells, and previous values of the input gate. The cell also incorporates the forget gate; the inputs are then fed into the output gate, which also takes previous values of the output gate as inputs.

How to understand bias parameter in LIBLINEAR?

I don't understand the meaning of bias parameter in the API of LIBLINEAR. Why is it specified by user during the training? Shouldn't it be just a distance from the separating hyperplane to origin which is a parameter of the learned model?
This is from the README:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
If bias >= 0, we assume that one additional feature is added to the end of each data instance.
What is this additional feature?
Let's look at the equation for the separating hyperplane:
w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + ... + w_bias * x_bias = 0
Where x are the feature values and w are the trained "weights". The additional feature x_bias is a constant, whose value is equal to the bias. If bias = 0, you will get a separating hyperplane going through the origin (0,0,0,...). You can imagine many cases, where such a hyperplane is not the optimal separator.
The value of the bias affects the margin through scaling of w_bias. Therefore the bias is a tuning parameter, which is usually determined through cross-validation similar to other parameters.