I use a pre-trained network from Tensorflow-Hub and pass the outcoming vector through 2 fully connected layers. I initialize the weight matrices with He-initialization and the biases with 0.
The loss function is behaving strangly. Also it does update the weights matrices somewhat, but mainly the biases.
Does anybody know, how to improve the learning?
Thanks in advance!
with tf.name_scope('tf_hub'):
module = hub.Module("https://tfhub.dev/google/imagenet/pnasnet_large/feature_vector/2")
tf_hub_features = module(X) # Features with shape [batch_size, num_features].
he_initializer = tf.contrib.layers.variance_scaling_initializer(factor=2.0, mode='FAN_IN', uniform=False)
with tf.name_scope('Hidden1'):
W1 = tf.get_variable(initializer=he_initializer, shape=[Constants.PNAS_NET2_NB_FEATURES, config["h1_nb_units"]],
name="W1")
# W1 = tf.Variable(tf.random_normal([Constants.PNAS_NET2_NB_FEATURES, config["h1_nb_units"]]), name="W1")
tf.summary.histogram("W1", W1)
b1 = tf.Variable(tf.zeros([config["h1_nb_units"]]), name="b1")
tf.summary.histogram("b1", b1)
o1 = tf.nn.relu(tf.matmul(tf_hub_features, W1) + b1, name="o1")
# dropout1 = tf.layers.dropout(inputs=o1, rate=config["keep_probability"], name="dropout1")
with tf.name_scope('Hidden2'):
W2 = tf.get_variable(initializer=he_initializer, shape=[config["h1_nb_units"], config["h2_nb_units"]],
name="W2")
tf.summary.histogram("W2", W2)
b2 = tf.Variable(tf.zeros([config["h2_nb_units"]]), name="b2")
tf.summary.histogram("b2", b2)
o2 = tf.nn.relu(tf.matmul(o1, W2) + b2, name="o2")
with tf.name_scope('Y'):
WY = tf.get_variable(initializer=he_initializer, shape=[config["h2_nb_units"], config["output_dim"]],
name="WY")
tf.summary.histogram("WY", WY)
bY = tf.Variable(tf.zeros([config["output_dim"]]), name="bY")
tf.summary.histogram("bY", bY)
Y_star = tf.add(tf.matmul(o2, WY), bY, name="Y_star")
Y = tf.nn.sigmoid(Y_star, name="Y")
with tf.name_scope('loss'):
Y_ = tf.placeholder(tf.float32, shape=(None, 1), name="Y_")
loss = tf.losses.log_loss(Y_, Y_hat)
optimizer = tf.train.AdamOptimizer(config["learning_rate"])
train_step = optimizer.minimize(loss)
The answer is quite simple. I had an error in feeding the input. They were all zeros and some ones. Therefore, there were only minor changes in the weights. I suppose the bias adjusted since it will learn something like the "mean" in linear regression.
Related
Code taken from:-http://adventuresinmachinelearning.com/python-tensorflow-tutorial/
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Python optimisation variables
learning_rate = 0.5
epochs = 10
batch_size = 100
# declare the training data placeholders
# input x - for 28 x 28 pixels = 784
x = tf.placeholder(tf.float32, [None, 784])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])
# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random_normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([10]), name='b2')
# calculate the output of the hidden layer
hidden_out = tf.add(tf.matmul(x, W1), b1)
hidden_out = tf.nn.relu(hidden_out)
# now calculate the hidden layer output - in this case, let's use a softmax activated
# output layer
y_ = tf.nn.softmax(tf.add(tf.matmul(hidden_out, W2), b2))
y_clipped = tf.clip_by_value(y_, 1e-10, 0.9999999)
cross_entropy = -tf.reduce_mean(tf.reduce_sum(y * tf.log(y_clipped)
+ (1 - y) * tf.log(1 - y_clipped), axis=1))
# add an optimiser
optimiser = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
# finally setup the initialisation operator
init_op = tf.global_variables_initializer()
# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# start the session
with tf.Session() as sess:
# initialise the variables
sess.run(init_op)
total_batch = int(len(mnist.train.labels) / batch_size)
for epoch in range(epochs):
avg_cost = 0
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
_, c = sess.run([optimiser, cross_entropy],
feed_dict={x: batch_x, y: batch_y})
avg_cost += c / total_batch
print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels}))
I wanted to ask, how does tensorflow recognize the parameters it needs to optimize , like in the above code we need to optimize w1,w2,b1 & b2 but we never specified that anywhere. We did ask GradientDescentOptimizer to minimize cross_entropy but we never told it that it would have to change the values of w1,w2,b1&b2 in order to do so , So how did it know the parameters on which cross_entropy depended upon?
The answer by Cory Nezin is only partially correct, and could lead to wrong assumptions!
You actually do specify which parameters are optimized (=trainable), namely by doing this:
# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([300]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random_normal([300, 10], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([10]), name='b2')
In short, TensorFlow will only update tf.Variables. If you would use something like tf.Variable(...,trainable=False), you would not get any updates, regardless of what "network depends on". You would have still specified it, and the network would still propagate through that part, but you would never receive any updates for that specific variable.
Cory's answer is correct in the way that the network does automatically recognize what values to update it with, but you specify what has to be defined/updated first!
TensorFlow works on the premise of something called a computational graph. Essentially, whenever you say something like:
hidden_out = tf.add(tf.matmul(x, W1), b1)
TensorFlow says ok, so that output clearly depends on W1, I'll connect an edge from "hidden_out" to W1. This same process happens for y_, y_clipped, and cross_entropy. So in the end you have a graph which connects cross_entropy to W1. Pick your favorite graph traversal algorithm and TensorFlow finds the connection between cross entropy and W1.
I'm trying to build an a3c implementation in keras. I have experience working with keras, but absolutely no experience working with tensorflow. So I would really apreciate if someone could make it as simple as possible, since I want to finish it as fast as possible without diving too deep into tensorflow.
self.session = tf.Session()
K.set_session(self.session)
K.manual_variable_initialization(True)
self.stop_signal = False
self.model = self._build_model()
self.graph = self._build_graph(self.model)
self.session.run(tf.global_variables_initializer())
self.default_graph = tf.get_default_graph()
self.default_graph.finalize() # avoid modifications
def _build_model(self):
l_input = Input(batch_shape=(None, NUM_STATE))
input_layer = Reshape((1, -1))(l_input)
lstm = LSTM(64, activation='relu', return_sequences=True)(input_layer)
lstm = LSTM(128, activation='relu', return_sequences=True)(lstm)
lstm = LSTM(128, activation='relu')(lstm)
out_actions = Dense(NUM_ACTIONS, activation='softmax')(lstm)
out_value = Dense(1, activation='linear')(lstm)
model = Model(inputs=[l_input], outputs=[out_actions, out_value])
model._make_predict_function() # have to initialize before threading
return model
def _build_graph(self, model):
s_t = tf.placeholder(tf.float32, shape=(None, NUM_STATE))
a_t = tf.placeholder(tf.float32, shape=(None, NUM_ACTIONS))
r_t = tf.placeholder(tf.float32, shape=(None, 1))
p, v = model(s_t)
log_prob = tf.log(tf.reduce_sum(p * a_t, axis=1, keepdims=True) + 1e-10)
advantage = r_t - v
loss_policy = - log_prob * tf.stop_gradient(advantage)
loss_value = LOSS_V * tf.square(advantage)
entropy = LOSS_ENTROPY * tf.reduce_sum(p * tf.log(p + 1e-10), axis=1, keepdims=True)
loss_total = tf.reduce_mean(loss_policy + loss_value + entropy)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
minimize = optimizer.minimize(loss_total)
return s_t, a_t, r_t, minimize
Then it is beeing trained:
s_t, a_t, r_t, minimize = self.graph
self.session.run(minimize, feed_dict={s_t: s, a_t: a, r_t: r})
Predictions are done this way:
with self.default_graph.as_default():
p, v = self.model.predict(s)
So I want to update my keras model weights using these gradients after I finish training in order to save it using model.save('path.h5'). Peudo code:
model_weights = model.trainable_weights
model_weights = apply_gradients(grades, model_weights)
model = model.set_weights(model_weights)
model.save('path.h5')
The code was taken from here with little changes: https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py
I found something on this topic but can't really figure out how to actually use it.
https://github.com/keras-team/keras/issues/3062
https://github.com/keras-team/keras/issues/3069
Turns out the problem has to do with the algorithm not converging properly. If someone knows what can I do to make it converge? I'm using custom environment and I trained a DQN on this environment in the past and it successfully converged. I aslo implemented target model which I update every 300 steps (or 1 episode in my case).
I want to understand that how l2 regularization is implement here. In l2 regularization we add a square of weights to the loss function. But in this code we are also adding bias term. Why is it so?
`x = tf.placeholder(tf.float32, [None, nPixels])
W1 = tf.Variable(tf.random_normal([nPixels, nNodes1], stddev=0.01))
b1 = tf.Variable(tf.zeros([nNodes1])
y1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1)
W2 = tf.Variable(tf.random_normal([nNodes1, nLabels], stddev=0.01))
b2 = tf.Variable(tf.zeros([nLabels]))
y = tf.matmul(y1, W2) + b2
y_ = tf.placeholder(tf.float32, [None, nLabels])
l2_loss = tf.nn.l2_loss(W1) + tf.nn.l2_loss(b1) + tf.nn.l2_loss(W2) +
tf.nn.l2_loss(b2)
cross_entropy =
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y))
regularized_cross_entropy = cross_entropy + beta * l2_loss`
This bias is not the same we used in l2 regularization.
This is the bias we add in neural network to keep the value not equal to zero.
I have been trying for a while to implement sampled softmax because I have half a million output classes.
I have tried to follow the official documentation exactly, but I always get an error. This is my code:
def forward_propagation_sampled(X, parameters):
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
W3 = parameters['W3']
b3 = parameters['b3']
Z1 = tf.add(tf.matmul(W1, X), b1)
A1 = tf.nn.relu(Z1)
Z2 = tf.add(tf.matmul(W2,A1), b2)
A2 = tf.nn.relu(Z2)
Z3 = tf.add(tf.matmul(W3,A2), b3)
return Z3, W3, b3
This is the cost computation function:
def compute_cost(Z3, W3, b3, Y, mode):
Z3.set_shape([1144,1])
if mode == "train":
loss = tf.nn.sampled_softmax_loss(
weights=tf.transpose(W3),
biases=tf.Variable(b3),
labels = tf.reshape(tf.argmax(Y, 1), [-1,1]), #Since Y is one hot encoded
inputs=tf.Variable(initial_value=Z3,dtype=tf.float32, expected_shape=[1144,1]),
num_sampled = 2000,
num_classes = 1144,
partition_strategy="div"
)
elif mode == "eval":
logits = tf.matmul(inputs, tf.transpose(weights))
logits = tf.nn.bias_add(logits, biases)
labels_one_hot = tf.one_hot(labels, n_classes)
loss = tf.nn.softmax_cross_entropy_with_logits(labels=labels_one_hot,logits=logits)
cost = tf.reduce_mean(loss)
return cost
For the purpose of just testing this out, I am using 1144 output classes, which would otherwise scale to 500,000. There are 3144 training examples.
I get this error:
Shape must be rank 1 but is rank 2 for 'sampled_softmax_loss/Slice_1' (op: 'Slice') with input shapes: [3144,1], [1], [1].
I am unable to debug this or make any sense out of it. Any help would be really appreciated.
I am new to tensorflow and have tried to implement a simple one-layer linear network similar to https://www.tensorflow.org/get_started/mnist/beginners
x = tf.placeholder(tf.float32, [None, IN_SIZE], name="input")
W1 = tf.Variable(tf.zeros([IN_SIZE, OUT_SIZE]), name="Weight1")
b1 = tf.Variable(tf.zeros([OUT_SIZE]), name="bias1")
y = tf.matmul(x, W1) + b1
y_ = tf.placeholder(tf.float32, [None, OUT_SIZE], name="target")
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
train_step = tf.train.AdamOptimizer(1e-3).minimize(cross_entropy)
The program works as expected and I have no problem on that. However, I try to add another layer but only found the W1,b1,W2 learnt are all zero matrix, and only the bias b2 contains nonzero values. Below is my modified network
x = tf.placeholder(tf.float32, [None, IN_SIZE], name="input")
W1 = tf.Variable(tf.zeros([IN_SIZE, L1_SIZE]), name="Weight1")
b1 = tf.Variable(tf.zeros([L1_SIZE]), name="bias1")
y = tf.matmul(x, W1) + b1
W2 = tf.Variable(tf.zeros([L1_SIZE, OUT_SIZE]), name="Weight2")
b2 = tf.Variable(tf.zeros([OUT_SIZE]), name="bias2")
y = tf.nn.relu(y)
y = tf.matmul(y, W2) + b2
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, OUT_SIZE], name="target")
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
train_step = tf.train.AdamOptimizer(1e-3).minimize(cross_entropy)
The problem is that if you initialize the weight matrices before a relu with zeroes the gradients will always be zero and no learning will happen. You need to do random initialization.