policy gradient with binary action space - tensorflow

I am training the agent using policy gradient method. After training, the agent would always choose one of two actions.
Below is my code
action = tf.where(self.model(state)[:,-1] > 0.5, 1., 0.)
reward = self.get_rewards(action, state)
with tf.GradientTape() as tape:
tape.watch(self.model.trainable_weights)
prob = self.model(state, True)
dist = tfp.distributions.Categorical(probs = prob)
log_prob = dist.log_prob(action)
loss = - tf.math.reduce_mean(reward * log_prob)
grads = tape.gradient(loss, self.model.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
Where self.get_rewards(action, state) returns positive return (properly calculated), and self.model(state) returns probability [p, (1-p)].
my guess is that the optimal choice is either p = 0 or 1 as it will make loss equal to 0, which is always minimum. It is the minimum since reward is always positive and log_prob is always negative, so - reward * log_prob will always be positive.
Is there anyway to fix this problem? I tried to use off-policy gradient, but it did not help me much. I am not sure why though.

Related

Why tf.keras.Model training flag significantly alters the prediction result?

I was recently going through the tensorflow pix2pix tutorial and after playing with it a bit I unexpectedly realized that there is a major difference between the predictions of a tf.keras.Model (In this case the Generator() from the tutorial) where one of the prediction use the training flag to true and the other to false.
Here is the code to demonstrate the issue:
# ...Tutorial steps where I load the model instead of creating a new one...
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
for example_input, example_target in test_dataset.take(1):
train_res1 = generate_images(generator, example_input, example_target) # Function definition as per tutorial expect that I return the 'Predicted Image'
train_res2 = generate_images(generator, example_input, example_target) # Now considering training is true (will alter the model), a small RGB difference is expected
notrain_res2 = generate_images2(generator, example_input, example_target) # Identical to 'generate_images' except that 'training=false', should be identical or similar to last one.
r_avg = np.average(train_res1[:,:, 0])
g_avg = np.average(train_res1[:,:, 1])
b_avg = np.average(train_res1[:,:, 2])
print(f"Training flag true iteration#1 = R average: {r_avg}, G average: {g_avg}, B average: {b_avg}")
r_avg = np.average(train_res2[:,:, 0])
g_avg = np.average(train_res2[:,:, 1])
b_avg = np.average(train_res2[:,:, 2])
print(f"Training flag true iteration#2 = R average: {r_avg}, G average: {g_avg}, B average: {b_avg}")
r_avg = np.average(notrain_res2[:,:, 0])
g_avg = np.average(notrain_res2[:,:, 1])
b_avg = np.average(notrain_res2[:,:, 2])
print(f"Training flag false = R average: {r_avg}, G average: {g_avg}, B average: {b_avg}")
Just to avoid any confusion, here is the code of generate_images2 which is identical to generate_images from tutorial except that 'training=False' and I return the prediction:
def generate_images2(model, test_input, tar):
prediction = model(test_input, training=False)
plt.figure(figsize=(15, 15))
display_list = [test_input[0], tar[0], prediction[0]]
title = ['Input Image', 'Ground Truth', 'Predicted Image']
for i in range(3):
plt.subplot(1, 3, i+1)
plt.title(title[i], color = "w")
# Getting the pixel values in the [0, 1] range to plot.
plt.imshow(display_list[i] * 0.5 + 0.5)
plt.axis('off')
plt.show()
return display_list[2]
Here you can vizualize my concerns with the training flag.
As expected there are minor differences between the RGB values of iteration#1 and iteration#2 with training flag = True. This is expected due to training model alterations.
However when training flag = False, I would expect the RGB values to be similar or identical to the iteration#2 with training flag = True but if you look at the door in the yellow and red circle s the RGB values are clearly different.
The result is pretty much always better with training=True
Question: Why tf.keras.Model training flag significantly alters the prediction result?
Here's an answer from the Tensorflow repo:
https://github.com/tensorflow/tensorflow/issues/36936
There are some things that only happen during training, for example dropout is used. If training=False then dropout layers are ignored (see https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout)

GradientTape for variable weighted sum of two Sequential models in TensorFlow

Suppose we want to minimize the following equation using gradient descent:
min f(alpha * v + (1-alpha)*w) with v and w the model weights and alpha the weight, between 0 and 1, for the sum resulting in the combined model v_bar or ū (here referred to as m).
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
w_weights = tff.learning.ModelWeights.from_model(w)
v_weights = tff.learning.ModelWeights.from_model(v)
m_weights = tff.learning.ModelWeights.from_model(m)
m_weights_trainable = tf.nest.map_structure(lambda v, w: alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable, m_weights_trainable)
In the paper of Adaptive Personalized Federated Learning, formula with update step for alpha suggests updating alpha based on the gradients of model m applied on a minibatch. I tried it with the watch or without, but it always leads to No gradients provided for any variable
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch([alpha])
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Some more information about the initialization of the models:
The m.forward_pass(batch) is the default implementation from tff.learning.Model (found here) by creating a model with tff.learning.from_keras_model and a tf.keras.Sequential model.
def model_fn():
keras_model = create_keras_model()
return tff.learning.from_keras_model(
keras_model,
input_spec = element_spec,
loss = tf.keras.losses.MeanSquaredError(),
metrics = [tf.keras.metrics.MeanSquaredError(),
tf.keras.metrics.MeanAbsoluteError()],
)
w = model_fn()
v = model_fn()
m = model_fn()
Some more experimenting as suggested below by Zachary Garrett:
It seems that whenever this weighted sum is calculated, and the new weights for the model are assigned, then it loses track of the previous trainable variables of both summed models. Again, it leads to the No gradients provided for any variable whenever optimizer.apply_gradients(zip([grad], [alpha])) is called. All gradients seem to be None.
with tf.GradientTape() as tape:
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
m_weights_t = tf.nest.map_structure(lambda w, v: tf.math.scalar_mul(alpha, v, name=None) + tf.math.scalar_mul(tf.constant(1.0) - alpha, w, name=None),
w.trainable,
v.trainable)
m_weights = tff.learning.ModelWeights.from_model(m)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable,
m_weights_trainable)
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Another edit:
I think I have a strategy to get it working, but it is bad practice as manually setting trainable_weights or _trainable_weights does not work. Any tips on improving this?
def do_weighted_combination():
def _mapper(target_layer, v_layer, w_layer):
target_layer.kernel = v_layer.kernel * alpha + w_layer.kernel * (1-alpha)
target_layer.bias = v_layer.bias * alpha + w_layer.bias * (1-alpha)
tf.nest.map_structure(_mapper, m.layers, v.layers, w.layers)
with tf.GradientTape(persistent=True) as tape:
do_weighted_combination()
predictions = m(x_data)
loss = m.compiled_loss(y_data, predictions)
g1 = tape.gradient(loss, v.trainable_weights) # Not None
g2 = tape.gradient(loss, alpha) # Not None
For TensorFlow auto-differentiation using tf.GradientTape, operations must occur within the tf.GradientTape Python context manager so that TensorFlow can "see" them.
Possibly what is happening here is that alpha is used outside/before the tape context, when setting the model variables. Then when m.forwad_pass is called TensorFlow doesn't see any access to alpha and thus can't compute a gradient for it (instead returning None).
Moving the
alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable
logic inside the tf.GradientTape context manager (possibly inside m.forward_pass) may be a solution.

How to use an optimizer within a forward pass in PyTorch

I want to use an optimizer within the forward pass of a custom defined Function, but it doesn't work. My code is as follows:
class MyFct(Function):
#staticmethod
def forward(ctx, *args):
input, weight, bias = args[0], args[1], args[2]
y = torch.tensor([[0]], dtype=torch.float, requires_grad=True) #initial guess
loss_fn = lambda y_star: (input + weight - y_star)**2
learning_rate = 1e-4
optimizer = torch.optim.Adam([y], lr=learning_rate)
for t in range(5000):
y_star = y
print(y_star)
loss = loss_fn(y_star)
if t % 100 == 99:
print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
return y_star
And that's my test inputs:
x = torch.tensor([[2]], dtype=torch.float, requires_grad=True)
w = torch.tensor([[2]], dtype=torch.float, requires_grad=True)
y = torch.tensor([[6]], dtype=torch.float)
fct= MyFct.apply
y_hat = fct(x, w, None)
I always get the RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.
Also, I've tested the optimization outside of the forward and it works, so I guess it's something with the context? According to the documentation "Tensor arguments that track history (i.e., with requires_grad=True) will be converted to ones that don’t track history before the call, and their use will be registered in the graph", see https://pytorch.org/docs/stable/notes/extending.html. Is this the problem? Is there a way to work around it?
I am new to PyTorch and I wonder what I'm overlooking. Any help and explanation is appreciated.
I think I found an answer here: https://github.com/pytorch/pytorch/issues/8847 , i.e. I need to wrap the oprimization with with torch.enable_grad():.
However, I still don't understand why it's necessary to convert the original Tensors to ones that don’t track history in forward().

consistent forward / backward pass with tensorflow dropout

For the reinforcement learning one usually applies forward pass of the neural network for each step of the episode in order to calculate policy. Afterwards one could calculate parameter gradients using backpropagation. Simplified implementation of my network looks like this:
class AC_Network(object):
def __init__(self, s_size, a_size, scope, trainer, parameters_net):
with tf.variable_scope(scope):
self.is_training = tf.placeholder(shape=[], dtype=tf.bool)
self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
# (...)
layer = slim.fully_connected(self.inputs,
layer_size,
activation_fn=tf.nn.relu,
biases_initializer=None)
layer = tf.contrib.layers.dropout(inputs=layer, keep_prob=parameters_net["dropout_keep_prob"],
is_training=self.is_training)
self.policy = slim.fully_connected(layer, a_size,
activation_fn=tf.nn.softmax,
biases_initializer=None)
self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
responsible_outputs = tf.reduce_sum(self.policy * actions_onehot, [1])
self.policy_loss = - policy_loss_multiplier * tf.reduce_mean(tf.log(responsible_outputs) * self.advantages)
local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
self.gradients = tf.gradients(self.policy_loss, local_vars)
Now during training I will fist rollout the episode by consecutive forward passes (again, simplified version):
s = self.local_env.reset() # list of input variables for the first step
while done == False:
a_dist = sess.run([self.policy],
feed_dict = {self.local_AC.inputs: [s],
self.is_training: True})
a = np.argmax(a_dist)
s, r, done, extra_stat = self.local_env.step(a)
# (...)
and in the end I will calculate gradients by backward pass:
p_l, grad = sess.run([self.policy_loss,
self.gradients],
feed_dict={self.inputs: np.vstack(comb_observations),
self.is_training: True,
self.actions: np.hstack(comb_actions),})
(please note that I could have made a mistake somewhere above trying to remove as much as possible of the original code irrelevant to the issue in question)
So finally the question: Is there a way of ensuring that all the consecutive calls to the sess.run() will generate the same dropout structure? Ideally I would like to have exactly the same dropout structure within each episode and only change it between episodes. Things seem to work well as they are but I continue to wonder.

tensorflow weights only 2 values change?

I wrote a simple NN with tensorflow to actuate a real robotic finger.
My problem is, that after training for an hour, it seemed that it has a little bit learned in which direction to go, but when i look at the weights in tensorboard, it seemed that only two values gets updated, the other values all stay around zero (to which they where initialzed) ?
Here is my code:
https://github.com/flobotics/flobotics_tensorflow_controller/blob/master/nodes/listener.py
The loss is decreasing like it should, so looks good, even if it isnt :)
EDIT:
i tried to minimize the code to this, hope its ok so ?
NUM_STATES = 200+200+1024+1024 #200 degree angle_goal, 200 possible degrees the joint could move, 1024 force values, two times
NUM_ACTIONS = 9 #3^2=9 ,one stop-state, one different speed left, one diff.speed right, two servos
session = tf.Session()
build_reward_state()
state = tf.placeholder("float", [None, NUM_STATES])
action = tf.placeholder("float", [None, NUM_ACTIONS])
target = tf.placeholder("float", [None])
Weights = tf.Variable(tf.truncated_normal([NUM_STATES, NUM_ACTIONS], mean=0.1, stddev=0.02, dtype=tf.float32, seed=1), name="Weights")
biases = tf.Variable(tf.zeros([NUM_ACTIONS]), name="biases")
output = tf.matmul(state, Weights) + biases
output1 = tf.nn.relu(output)
readout_action = tf.reduce_sum(tf.mul(output1, action), reduction_indices=1)
loss = tf.reduce_mean(tf.square(target - readout_action))
train_operation = tf.train.AdamOptimizer(0.1).minimize(loss)
session.run(tf.initialize_all_variables())
while 1==1:
if a==0:
#a==0 is only run once at the beginning, then only a==1,2,3 are running
state_from_env = get_current_state() #we get an array of (1,2448)
last_action = do nothing #array (1,9), e.g. [0,0,1,0,0,0,0,0,0]
a=1
if a==1:
get random action or learned action, array of (1,9)
run this action (move servo motors)
save action in last_action
a=2
if a==2:
stop servo motors (so the movements are NOT continous)
a=3
if a==3:
get_current_state() #arrray of (1,2448)
get reward # one value
observations.append((last_state, last_action, reward, current_state))
if training_time:
get random sample from observations
agents_reward_per_action = session.run(output, feed_dict={state: current_states})
agents_expected_reward.append(rewards[i] + FUTURE_REWARD_DISCOUNT * np.max(agents_reward_per_action[i]))
_, result = session.run([train_operation, merged], feed_dict={state: previous_states, action : actions, target: agents_expected_reward})
#update values
last_state = current_state
a=1