Reinforcement learning a3c with multiple independent outputs - tensorflow

I am attempting to modify and implement googles pattern of the Asynchronous Advantage Actor Critic (A3C) model. There are plenty of examples online out there that have gotten me started but I am running into a issues attempting to expand the samples.
All of the examples I can find focus on pong as the example which has a state based output of left or right or stay still. What I am trying to expand this to is a system that also has a separate on off output. In the context of pong, it would be a boost to your speed.
The code I am basing my code on can be found here. It is playing doom, but it still has the same left and right but also a fire button instead of stay still. I am looking at how I could modify this code such that fire was an independent action from movement.
I know I can easily add another separate output from the model so that the outputs would look something like this:
self.output = slim.fully_connected(rnn_out,a_size,
self.output2 = slim.fully_connected(rnn_out,1,
The thing I am struggling with is how then do I have to modify the value output and redefine the loss function. The value is still tied to the combination of the two outputs. Or is there a separate value output for each of the independent output. I feel like it should still only be one output as the value, but I am unsure how I them use that one value and modify the loss function to take this into account.
I was thinking of adding a separate term to the loss function so that the calculation would look something like this:
self.actions_1 = tf.placeholder(shape=[None],dtype=tf.int32)
self.actions_2 = tf.placeholder(shape=[None],dtype=tf.float32)
self.actions_onehot = tf.one_hot(self.actions_1,a_size,dtype=tf.float32)
self.target_v = tf.placeholder(shape=[None],dtype=tf.float32)
self.advantages = tf.placeholder(shape=[None],dtype=tf.float32)
self.responsible_outputs = tf.reduce_sum(self.output1 * self.actions_onehot, [1])
self.responsible_outputs_2 = tf.reduce_sum(self.output2 * self.actions_2, [1])
#Loss functions
self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages) -
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01
I am looking to know if I am on the right track here, or if there are resources or examples that I can expand off of.

First of all, the example you are mentioning don't need two output nodes. One output node with continuous output value is enough to solve. Also you should't use placeholder for advantage, but rather you should use for discounted reward.
self.discounted_reward = tf.placeholder(shape=[None],dtype=tf.float32)
self.advantages = self.discounted_reward - self.value
Also while calculating the policy loss you have to use tf.stop_gradient to prevent the value node gradient feedback contribution for policy learning.
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*tf.stop_gradient(self.advantages))


How do I get value function/critic values from Rllib's PPO algorithm for a range of observations?

Goal: I want to train a PPO agent on a problem and determine its optimal value function for a range of observations. Later I plan to work with this value function (economic inequality research). The problem is sufficiently complex so that dynamic programming techniques no longer work.
Approach: In order to check, whether I get correct outputs for the value function, I have trained PPO on a simple problem, whose analytical solution is known. However, the results for the value function are rubbish, which is why I suspect that I have done sth wrong.
The code:
from keras import backend as k_util
parser = argparse.ArgumentParser()
# Define framework to use
choices=["tf", "tf2", "tfe", "torch"],
help="The DL framework specifier.",
def get_rllib_config(seeds, debug=False, framework="tf") -> Dict:
def get_value_function(agent, min_state, max_state):
policy = agent.get_policy()
value_function = []
for i in np.arange(min_state, max_state, 1):
model_out, _ = policy.model({"obs": np.array([[i]], dtype=np.float32)})
value = k_util.eval(policy.model.value_function())[0]
print(i, value)
return value_function
def train_schedule(config, reporter):
rllib_config = config["config"]
iterations = rllib_config.pop("training_iteration", 10)
agent = PPOTrainer(env=rllib_config["env"], config=rllib_config)
for _ in range(iterations):
result = agent.train()
values = get_value_function(agent, 0, 100)
resources = PPO.default_resource_request(exp_config)
tune_analysis = tune.Tuner(tune.with_resources(train_schedule, resources=resources), param_space=exp_config).fit()
So first I get the policy (policy = agent.get_policy()) and run a forward pass with each of the 100 values (model_out, _ = policy.model({"obs": np.array([[i]], dtype=np.float32)})). Then, after each forward pass I use the value_function() method to get the output of the critic network and evaluate the tensor via keras backend.
The results:
True VF (analytical solution)
VF output of Rllib
Unfortunately you can see that the results are not that promising. Maybe I have missed a pre- or postprocessing step? Does the value_function() method even return the last layer of the critic network?
I am very grateful for any help!
It's not part of your script, but I assume that you have trained the policy before you attempt to get useful values out of it.
You are correct in assuming that the value_function() returns the output of the last layer of the critic network in RLlib's implementations.
Have a look at the value function metrics to see if it's actually learning anything (RLlib logs .../learner_stats/vf_loss and .../learner_stats/vf_explained_var)!
After training the model, I'd also try to query the model directly. If that looks better, something is likely off with the code you posted here.

Generating a Plot of CV vs. Degrees of Freedom

I have a dataset (n=298), and I am currently working on a general additive model for it. There are three predictor variables and one response variable. I used this code to generate the GAM and perform leave one out cross validation:
ctrl <- trainControl(method = "LOOCV")
model <- train(response~ predictor1+ predictor2 + predictor3, data= data[2:5], method = "gam", trControl = ctrl)
While I think this worked in generating the model and performing cross validation, I'd like to graph the CV value over the degrees of freedom, similar to what is shown in the book image below. I'm not really sure how to go about this with my model as I am pretty new to using R.
[Graph Example
I tried to use plot(model), but it just outputs the graph below, which isn't very helpful and certainly isn't what I'm looking for. Any advice on how to approach this would be greatly appreciated. Thanks.
plot(model) Graph

Customized aggregation algorithm for gradient updates in tensorflow federated

I have been trying to implement this paper . Basically what I want to do is sum the per client loss and compare the same with previous epoch. Then for each constituent layer of the model compare the KL divergence between the weights of the server and the client model to get the layer specific parameter updates and then doing a softmax and to decide whether an adaptive update or a normal FedAvg approach is needed.
The algorithm is as follows-
I tried to make use of the code here to build a custom federated avg process. I got the basic understanding that there are some tf.computations and some tff.computations which are involved. I get that I need to make changes in the orchestration logic in the run_one_round function and basically manipulate the client outputs to do adaptive averaging instead of the vanilla federated averaging. The client_update tf.computation function basically returns all the values that I need i.e the weights_delta (can be used for client based model weights), model_output(which can be used to calculate the loss).
But I am not sure where exactly I should make the changes.
def run_one_round(server_state, federated_dataset):
server_message = tff.federated_map(server_message_fn, server_state)
server_message_at_client = tff.federated_broadcast(server_message)
client_outputs = tff.federated_map(
client_update_fn, (federated_dataset, server_message_at_client))
weight_denom = client_outputs.client_weight
# todo
# instead of using tff.federated_mean I wish to do a adaptive aggregation based on the client_outputs.weights_delta and server_state model
round_model_delta = tff.federated_mean(
client_outputs.weights_delta, weight=weight_denom)
#client_outputs.weights_delta has all the client model weights.
#client_outputs.client_weight has the number of examples per client.
#client_outputs.model_output has the output of the model per client example.
I want to make use of the server model weights using server_state object.
I want to calculate the KL divergence between the weights of server model and each client's model per layer. Then use a relative weight to aggregate the client weights instead of vanilla federated averaging.
Instead of using tff.federated_mean I wish to use a different strategy basically an adaptive one based on the algorithm above.
So I needed some suggestions on how to go about implementing this.
Basically what I want to do is :
1)Sum all the values of client losses.
2)Calculate the KL divergence per layerbasis of all the clients with server and then determine whether to use adaptive optimization or FedAvg.
Also is there a way to manipulate this value as a python value which will be helpful for debugging purposes( I tried to use tf.print but that was not helpful either). Thanks!
Simplest option: compute weights for mean on clients
If I read the algorithm above correctly, we need only compute some weights for a mean on-the-fly. tff.federated_mean accepts an optional CLIENTS-placed weight argument, so probably the simplest option here is to compute the desired weights on the clients and pass them in to the mean.
This would look something like (assuming the appropriate definitions of the variables used below, which we will comment on):
def round_function(...):
# We assume there is a tff.Computation training_fn that performs training,
# and we're calling it here on the correct arguments
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
# Next we assume there is a variable in-scope server_model,
# representing the 'current global model'.
global_model_at_clients = tff.federated_broadcast(server_model)
# Here we assume a function compute_kl_divergence, which takes
# two structures of tensors and computes the KL divergence
# (as a scalar) between them. The two arguments here are clients-placed,
# so the result will be as well.
kl_div_at_clients = tff.federated_map(compute_kl_divergence,
(global_model_at_clients, trained_clients))
# Perhaps we wish to not use raw KL divergence as the weight, but rather
# some function thereof; if so, we map a postprocessing function to
# the computed divergences. The result will still be clients-placed.
mean_weight = tff.federated_map(postprocess_divergence, kl_div_at_clients)
# Now we simply use the computed weights in the mean.
return tff.federated_mean(trained_clients, weight=mean_weight)
More flexible tool: tff.federated_reduce
TFF generally encourages algorithm developers to implement whatever they can 'in the aggregation', and as such exposes some highly customizable primitives like tff.federated_reduce, which allow you to run arbitrary TensorFlow "in the stream" between clients and server. If the above reading of the desired algorithm is incorrect and something more involved is needed, or you wish to flexibly experiment with totally different notions of aggregation (something TFF encourages and is designed to support), this may be the option for you.
In TFF's heuristic typing language, tff.federated_reduce has signature:
<{T}#CLIENTS, U, (<U, T> -> U)> -> U#SERVER
Meaning, federated_reduce take a value of type T placed at the clients, a 'zero' in a reduction algebra of type U, and a function accepting a U and a T and producing a U, and applies this function 'in the stream' on the way between clients and server, producing a U placed at the server. The function (<U, T> -> U) will be applied to the partially accumulated value U, and the 'next' element in the stream T (note however that TFF does not guarantee ordering of these values), returning another partially accumulated value U. The 'zero' should represent whatever 'partially accumulated' means over the empty set in your application; this will be the starting point of the reduction.
Application to this problem
The components
Your reduction function needs access to two pieces of data: the global model state and the result of training on a given client. This maps quite nicely to the type T. In this application, we will have something like:
T = <server_model=server_model_type, trained_model=trained_model_type>
These two types are likely to be the same, but may not necessarily be so.
Your reduction function will accept the partial aggregate, your server model and your client-trained model, returning a new partial aggregate. Here we will start assuming the same reading of the algorithm as above, that of a weighted mean with particular weights. Generally, the easiest way to compute a mean is to keep two accumulators, one for numerator and one for denominator. This will affect the choice of zero and reduction function below.
Your zero should contain a structure of tensors with value 0 mapping to the weights of your model--this will be the numerator. This would be generated for you if you had an aggregation like tff.federated_sum (as TFF knows what the zero should be), but for this case you'll have to get your hands on such a tensor yourself. This shouldn't be too hard with tf.nest.map_structure and tf.zeros_like.
For the denominator, we will assume we just need a scalar. TFF and TF are much more flexible than this--you could keep a per-layer or per-parameter denominator if desired--but for simplicity we will assume that we just want to divide by a single float in the end.
Therefore our type U will be something like:
U = <numerator=server_model_type, denominator=tf.float32>
Finally we come to our reduction function. It will be more or less a different composition of the same pieces above; we will make slightly tighter assumptions about them here (in particular, that all the local functions are tff.tf_computations--a technical assumption, arguably a bug on TFF). Our reduction function will be along the lines (assuming appropriate type aliases):
#tff.tf_computation(U, T)
def reduction(partial_accumulate, next_element):
kl_div = compute_kl_divergence(
next_element.server_model, next_element.trained_model)
weight = postprocess_divergence(kl_div)
new_numerator = partial_accumulate.numerator + weight * next_element.trained_model
new_denominator = partial_accumulate.denominator + weight
return collections.OrderedDict(
numerator=new_numerator, denominator=new_denominator)
Putting them together
The basic outline of a round will be similar to the above; but we have put more computation 'in the stream', and consequently there wil be less on the clients. We assume here the same variable definitions.
def round_function(...):
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
global_model_at_clients = tff.federated_broadcast(server_model)
# This zip I believe is not necessary, but it helps my mental model.
reduction_arg = tff.federated_zip(
# We assume a zero as specified above
return tff.federated_reduce(reduction_arg,

Unsure whether function breaks backpropagation

I have been tinkering around a lot with tensorflow in the past few days however I am quite unsure whether a function I wrote would break the backpropagation in a Neural network. I thought I'd ask here before I try to integrate this function in a NN. So the basic setup is I want to add two matricies with
op = tf.add(tfObject, tfImageBackground)
where tfImageBackground is some constant image. (i.e. an RGBA image of size 800, 800 with R = G = B = A = 0) and the tfObject is again a matrix with the same dimenstion however we get that with the function I am unsure about
def getObject(vector):
objectId = vector[0]
x = vector[1]
y = vector[2]
xEnd = baseImageSize-(x+objectSize)
yStart =baseImageSize- (y+objectSize)
padding = tf.convert_to_tensor([[x, xEnd], [yStart, y],[0,0]])
RTensor = tfObjectMatrix[objectId,:,:,0:1]
GTensor = tfObjectMatrix[objectId,:,:,1:2]
BTensor = tfObjectMatrix[objectId,:,:,2:3]
ATensor = tfObjectMatrix[objectId,:,:,3:4]
paddedR = tf.pad(tensor = RTensor,
paddings= padding,
generates padding for every channel
finalTensor=tf.concat([paddedR, paddedG, paddedB, paddedA], 2)
return finalTensor
The tfObjectMatrix is a list of images which never change.
I did check wether I was able to generate a tf.gradient from the op, which turned out to work. I am unsure if that is sufficient for backpropagation to work though.
Thanks for you time and effort. Any input at all would be greatly appreciated.
TensorFlow will backpropagate to everything by default. As per your code, everything will receive gradients with a training operation from an optimizer. So to answer your question, backpropagation will work.
The only thing to consider, is that you say tfObjectMatrix is a list of images that will not change. So you might not want it to receive any gradients. Therefore you might want to look into tf.stop_gradient() and maybe use it like OM = tf.stop_gradient( tfObjectMatrix ) and work with that OM in your function.

Different Evaluation of Seemingly Same Tensors

In my program I have:
run_plain = neural_network_model(x)
run_max = tf.argmax(run_plain, 1)
run_network = tf.argmax(neural_network_model(x), 1)
run_max and run_network give me different outputs when executed with the same input, e.g. via run_max.eval({x:[test_x[i]]}).
Is there some fundamental flaw about how Tensorflow eval() works that I am misunderstanding - because in my opinion the results should be the same or is there some other error in my code?
Could you post your whole example?
Otherwise based on what has been given there should be no difference between the two examples.