I've learned ML and have been learning DL from Andrew N.G's coursera courses, and every time he talks about a linear classifier, the weights are just a 1-D vector.
Even during the assignments, when we roll an image into a 1-D vector(pixels * 3), the weights would still be a 1-D vector.
I now have started O'Reilly's "Learning TensorFlow" book, and came across the first example. The weights initialization in tensorflow was a bit different.
The book says the following(Page 14):
"Since we are not going to use the spatial information at this point, we will unroll our image pixels as a single long vector denoted x (Figure 2-2). Then
$xw^0 = ∑x_i w^0_i$
will be the evidence for the image containing the digit 0 (and in the same way we will have $w^d$ weight vectors for each one of the other digits, d = 1, . . . , 9)."
and the corresponding TensorFlow code:
data = input_data.read_data_sets(DATA_DIR, one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y_true = tf.placeholder(tf.float32, [None, 10])
y_pred = tf.matmul(x, W)
Why are the weights 2-D here. Are weights 2-D in softmax Linear Classifier?
In the coursera course, when he taught Softmax Linear Classifier, he still says the weights are 1-D. Can anyone explain this?
Yes, You are right that weights are 1-D, but that is just for 1 neuron.
If you consider a straightforward layered neural network, it will have some number of layers(Just 1 layer with 10 neurons in your code). So, in tensorflow, the weights variable contains weights for entire layer and not a single neuron, which makes it a 2-D array.
W = tf.Variable(tf.zeros([784, 10]))
This line means that there are 10 neurons, each with a weight array of length 784.
One rule of thumb to understand this in tensorflow is that weight dimentions are written as..
W = tf.Variable(tf.zeros([output_of_previous_layer, output_of_current_layer]))
or
W = tf.Variable(tf.zeros([input_of_current_layer, input_of_next_layer]))
You can read more about this at Intro to Neural Networks
Related
I'm trying out something from a paper[1], where it is required to have a common bias term for all the neurons in the FC layer, instead of individual biases for each neuron.
How do I do it in Tensorflow?
Is it possible to create an FC layer without biases, and finally add tf.nn.bias_add()? Is that the right way to do it?
If so, there is no flag to set tf.nn.bias_add() as trainable. Will it work?
Or any other suggestions?
References:
[1]: Deep Reinforcement Learning with Double Q-Learning (Pg.6 right side)
tf.nn.bias_add(value, bias) requires A 1-D Tensor with size matching the last dimension of value. so just tf.add will do:
X = tf.placeholder(tf.float32, [None, 32])
W = tf.Variable(tf.truncated_normal([32, 16]))
#Single bias for all the 16 output
bias = tf.Variable(tf.truncated_normal([1]))
#Works as expected
y = tf.add(tf.matmul(X,W), bias)
#ValueError: Dimensions must be equal, but are 16 and 1 for 'BiasAdd'
z = tf.nn.bias_add(tf.matmul(X,W), bias)
I'm studying LSTM with CNN in tensorflow.
I want to put some scalar label into LSTM network as a condition.
Does anybody know which LSTM is what I meant?
If available, please let me know the usage of that
Thank you.
This thread might interest you: Adding Features To Time Series Model LSTM.
You have basically 3 possible ways:
Let's take an example with weather data from two different cities: Paris and San Francisco. You want to predict the next temperature based on historical data. But at the same time, you expect the weather to change based on the city. You can either:
Combine the auxiliary features with the time series data, at the beginning or at the end (ugly!).
Concatenate the auxiliary features with the output of the RNN layer. It's some kind of post-RNN adjustment since the RNN layer won't see this auxiliary info.
Or just initialize the RNN states with a learned representation of the condition (e.g. Paris or San Francisco).
I wrote a library to condition on auxiliary inputs. It abstracts all the complexity and has been designed to be as user-friendly as possible:
https://github.com/philipperemy/cond_rnn/
The implementation is in tensorflow (>=1.13.1) and Keras.
Hope it helps!
Heres an example of applying CNN and LSTM over the output probabilities of a sequence, like you asked:
def build_model(inputs):
BATCH_SIZE = 4
NUM_CLASSES = 2
NUM_UNITS = 128
H = 224
W = 224
C = 3
TIME_STEPS = 4
# inputs is assumed to be of shape (BATCH_SIZE, TIME_STEPS, H, W, C)
# reshape your input such that you can apply the CNN for all images
input_cnn_reshaped = tf.reshape(inputs, (-1, H, W, C))
# define CNN, for instance vgg 16
cnn_logits_output, _ = vgg_16(input_cnn_reshaped, num_classes=NUM_CLASSES)
cnn_probabilities_output = tf.nn.softmax(cnn_logits_output)
# reshape back to time series convention
cnn_probabilities_output = tf.reshape(cnn_probabilities_output, (BATCH_SIZE, TIME_STEPS, NUM_CLASSES))
# perform LSTM over the probabilities per image
cell = tf.contrib.rnn.LSTMCell(NUM_UNITS)
_, state = tf.nn.dynamic_rnn(cell, cnn_probabilities_output)
# employ FC layer over the last state
logits = tf.layers.dense(state, NUM_UNITS)
# logits is of shape (BATCH_SIZE, NUM_CLASSES)
return logits
By the way, a better approach would be to employ the LSTM over the last hidden layer, i.e to use the CNN as feature extractor and make the prediction over sequences of features.
I recently read this paper which deals with noisy labels in convolutional neural networks.
They model label noise by a probability transition matrix which forms a simple
constrained linear layer after the softmax output.
So as an example we may have a 3-by-3 probability transition matrix (3 classes):
Example probability transition matrix. The sum of each column has to be 1.
This matrix Q is basically trained in the same way as the rest of the network via backpropagation. But it needs to be constrained to be a probability matrix. Quote from the paper:
After taking a gradient step with the Q and the model
weights, we project Q back to the subspace of probability matrices because it represents conditional probabilities.
Now I am wondering what is the best way to implement such a layer in tensorflow.
I have some ideas but i'm not sure what could work or is best procedure.
1) Hard code the constraint in the model before any training is done, something like:
# ... build conv model without Q
[...]
# shape of y_conv (output CNN) assumed to be a [3,1] vector
y_conv = tf.nn.softmax(y_conv, 0)
# add linear layer representing Q, no bias
W_Q = weight_variable([3, 3])
# add constraint: columns are valid probability distribution
W_Q = tf.nn.softmax(W_Q, 0)
# output of model:
Q_out = tf.matmul(W_Q, y_conv)
# now compute loss, gradients and start training
2) Compute and apply gradients to the whole model (Q included), then apply constraint
train_op = ...
constraint_op = tf.assign(W_Q, tf.nn.softmax(W_Q,0))
sess = tf.session()
# compute and apply gradients in form of a train_op
sess.run(train_op)
sess.run(constraint_op)
I think the second approach is more related to the paper quote, but I am not sure to what extend external assignments confuse training.
Or maybe my ideas are bananas. I hope you can give me some advice!
I am trying to build a CLDNN that is researched in the paper here
After the convolutional layers, the features go through a dim-reduction layer. At the point when the features leave the conv layers, the dimensions are [?, N, M]. N represents the number of windows and I think the network requires the reduction in the dimension M, so the dimensions of the features after the dim-red layer is [?,N,Q] , where Q < M.
I have two questions.
How do I do this in TensorFlow? I tried using a weight with
W = tf.Variable( tf.truncated_normal([M,Q],stddev=0.1) )
I thought the multiplication of tf.matmul(x,W) would yield [?, N, Q] but [?, N, M] and [M, Q] are not valid dimensions for multiplication. I would like to keep N constant and reduce the dimension of M.
What kind of non-linearity should I apply to the outcome of tf.matmul(x,W)? I was thinking about using a ReLU but I couldn't even get #1 done.
According to the linked paper (T. N. Sainath et al.: "Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks"),
[...] reducing the dimensionality, such that we have 256 outputs from the linear layer, was appropriate.
That means, whatever the input size is, i.e. [?, N, M] or any other dimensionality (always assuming that the first dimension is the number of samples in a mini-batch, denoted by ?), the output will be [?, Q], where typically Q=256.
As we are doing dimensionality reduction by multiplying the input with a weight matrix, no spatial information will be preserved. This means, that it doesn't matter whether each input is a matrix or a vector, so we can reshape the input to the linear layer x to have the dimensions [?, N*M]. Then, we can create a simple matrix multiplication tf.matmul(x, W) where W is a matrix with dimensions [N*M, Q].
W = tf.Variable(tf.truncated_normal([N*M, Q], stddev=0.1))
x_vec = tf.reshape(x, shape=(-1, N*M))
y = tf.matmul(x_vec, W)
Finally, regarding question 2: in the paper, the dimensionality reduction layer is a linear layer, i.e. you do not apply a non-linearity to the output.
I have been trying to learn how to code up an RNN and LSTM in tensorflow. I found an example online on this blog post
http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html
Below are the snippets which I am having trouble understanding for an LSTM network to be used eventually for char-rnn generation
x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')
embeddings = tf.get_variable('embedding_matrix', [num_classes, state_size])
rnn_inputs = [tf.squeeze(i) for i in tf.split(1,
num_steps, tf.nn.embedding_lookup(embeddings, x))]
Different Section of the Code Now where the weights are defined
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
logits = [tf.matmul(rnn_output, W) + b for rnn_output in rnn_outputs]
y_as_list = [tf.squeeze(i, squeeze_dims=[1]) for i in tf.split(1, num_steps, y)]
x is the data to be fed, and y is the set of labels. In the lstm equations we have a series of gates, x(t) gets multiplied by a series and prev_hidden_state gets multiplied by some set of weights, biases are added and non-liniearities are applied.
Here are the doubts I have
In this case only one weight matrix is defined does that mean that
works for both x(t) and prev_hidden_state as well.
For the embeddings matrix I know it has to be multiplied by the
weight matrix but why is the first dimension num_classes
For the rnn_inputs we are using squeeze which removes dimensions of 1
but why would I want to do that in a one-hot-encoding.
Also from the splits I understand that we are unrolling the x of
dimension (batch_size X num_steps) into discrete (batch_size X 1)
vectors and then passing these values through the network is this
right
May I help you.
In this case only one weight matrix is defined does that mean that works for both x(t) and prev_hidden_state as well.
There are more weights as you call tf.nn.rnn_cell.LSTMCell. They are the internal weights of the RNN cell, which tensorflow created it implicitly when you call the cell.
The weight matrix you explicitly defined is the transform from the hidden state to the vocabulary space.
You can view the implicit weights accounting for the recurrent parts, taking the previous hidden state and current input and output the new hidden state. And the weight matrix you defined transform the hidden states(i.e. state_size = 200) to the higher vocabulary space.(i.e. vocab_size = 2000)
For further information, maybe you can view this tutorial : http://colah.github.io/posts/2015-08-Understanding-LSTMs/
For the embeddings matrix I know it has to be multiplied by the weight matrix but why is the first dimension num_classes
The num_classes accounts for the vocab_size, the embedding matrix is transforming the vocabulary to the required embedding size(in this example is equal to the state_size).
For the rnn_inputs we are using squeeze which removes dimensions of 1 but why would I want to do that in a one-hot-encoding.
You need to get rid of the extra dimension because tf.nn.rnn takes inputs as (batch_size, input_size) instead of (batch_size, 1, input_size).
Also from the splits I understand that we are unrolling the x of dimension (batch_size X num_steps) into discrete (batch_size X 1) vectors and then passing these values through the network is this right?
Being more precise, after embedding. (batch_size, num_steps, state_size) turns into a list of num_step elements, each of size (batch_size, 1, state_size).
The flow goes like this :
The embedding matrix embed each word as a state_size dimension vector(a row of the matrix), making the size (vocab_size, state_size).
Retrieve the the indices specified by the x placeholder and get the rnn input, which is size (batch_size, num_steps, state_size).
tf.split split the inputs to (batch_size, 1, state_size)
tf.squeeze sqeeze them to (batch_size, state_size), forming the desired input format for tf.nn.rnn.
If there's any problem with the tensorflow methods, maybe you can search them in the tensorflow API for more detailed introduction.