Can a tf-agents environment be defined with an unobservable exogenous state? - tensorflow

I apologize in advance for the question in the title not being very clear. I'm trying to train a reinforcement learning policy using tf-agents in which there exists some unobservable stochastic variable that affects the state.
For example, consider the standard CartPole problem, but we add wind where the velocity changes over time. I don't want to train an agent that relies on having observed the wind velocity at each step; I instead want the wind to affect the position and angular velocity of the pole, and the agent to learn to adapt just as it would in the wind-free environment. In this example however, we would need the wind velocity at the current time to be correlated with the wind velocity at the previous time e.g. we wouldn't want the wind velocity to change from 10m/s at time t to -10m/s at time t+1.
The problem I'm trying to solve is how to track the state of the exogenous variable without making it part of the observation spec that gets fed into the neural network when training the agent. Any guidance would be appreciated.

Yes, that is no problem at all. Your environment object (a subclass of PyEnvironment or TFEnvironment) can do whatever you want within it. The observation_spec requirement is only related to the TimeStep that you output in the step and reset methods (more precisely in your implementation of the _step and _reset abstract methods).
Your environment however is completely free to have any additional attributes that you might want (like parameters to control wind generation) and any number of additional methods you like (like methods to generate the wind at this timestep according to self._wind_hyper_params). A quick schematic of your code would look like is below:
class MyCartPole(PyEnvironment):
def __init__(self, ..., wind_HP):
... # self._observation_spec and _action_spec can be left unchanged
self._wind_hyper_params = wind_HP
self._wind_velocity = 0
self._state = ...
def _update_wind_velocity(self):
self._wind_velocity = ...
def factor_in_wind(self):
self.state = ... #update according to wind
def _step(self, action):
... # self._state update computations
self._update_wind_velocity
self.factor_in_wind()
observations = self._state_to_observations()
...

Related

Customized aggregation algorithm for gradient updates in tensorflow federated

I have been trying to implement this paper . Basically what I want to do is sum the per client loss and compare the same with previous epoch. Then for each constituent layer of the model compare the KL divergence between the weights of the server and the client model to get the layer specific parameter updates and then doing a softmax and to decide whether an adaptive update or a normal FedAvg approach is needed.
The algorithm is as follows-
FedMed
I tried to make use of the code here to build a custom federated avg process. I got the basic understanding that there are some tf.computations and some tff.computations which are involved. I get that I need to make changes in the orchestration logic in the run_one_round function and basically manipulate the client outputs to do adaptive averaging instead of the vanilla federated averaging. The client_update tf.computation function basically returns all the values that I need i.e the weights_delta (can be used for client based model weights), model_output(which can be used to calculate the loss).
But I am not sure where exactly I should make the changes.
#tff.federated_computation(federated_server_state_type,
federated_dataset_type)
def run_one_round(server_state, federated_dataset):
server_message = tff.federated_map(server_message_fn, server_state)
server_message_at_client = tff.federated_broadcast(server_message)
client_outputs = tff.federated_map(
client_update_fn, (federated_dataset, server_message_at_client))
weight_denom = client_outputs.client_weight
# todo
# instead of using tff.federated_mean I wish to do a adaptive aggregation based on the client_outputs.weights_delta and server_state model
round_model_delta = tff.federated_mean(
client_outputs.weights_delta, weight=weight_denom)
#client_outputs.weights_delta has all the client model weights.
#client_outputs.client_weight has the number of examples per client.
#client_outputs.model_output has the output of the model per client example.
I want to make use of the server model weights using server_state object.
I want to calculate the KL divergence between the weights of server model and each client's model per layer. Then use a relative weight to aggregate the client weights instead of vanilla federated averaging.
Instead of using tff.federated_mean I wish to use a different strategy basically an adaptive one based on the algorithm above.
So I needed some suggestions on how to go about implementing this.
Basically what I want to do is :
1)Sum all the values of client losses.
2)Calculate the KL divergence per layerbasis of all the clients with server and then determine whether to use adaptive optimization or FedAvg.
Also is there a way to manipulate this value as a python value which will be helpful for debugging purposes( I tried to use tf.print but that was not helpful either). Thanks!
Simplest option: compute weights for mean on clients
If I read the algorithm above correctly, we need only compute some weights for a mean on-the-fly. tff.federated_mean accepts an optional CLIENTS-placed weight argument, so probably the simplest option here is to compute the desired weights on the clients and pass them in to the mean.
This would look something like (assuming the appropriate definitions of the variables used below, which we will comment on):
#tff.federated_computation(...)
def round_function(...):
...
# We assume there is a tff.Computation training_fn that performs training,
# and we're calling it here on the correct arguments
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
# Next we assume there is a variable in-scope server_model,
# representing the 'current global model'.
global_model_at_clients = tff.federated_broadcast(server_model)
# Here we assume a function compute_kl_divergence, which takes
# two structures of tensors and computes the KL divergence
# (as a scalar) between them. The two arguments here are clients-placed,
# so the result will be as well.
kl_div_at_clients = tff.federated_map(compute_kl_divergence,
(global_model_at_clients, trained_clients))
# Perhaps we wish to not use raw KL divergence as the weight, but rather
# some function thereof; if so, we map a postprocessing function to
# the computed divergences. The result will still be clients-placed.
mean_weight = tff.federated_map(postprocess_divergence, kl_div_at_clients)
# Now we simply use the computed weights in the mean.
return tff.federated_mean(trained_clients, weight=mean_weight)
More flexible tool: tff.federated_reduce
TFF generally encourages algorithm developers to implement whatever they can 'in the aggregation', and as such exposes some highly customizable primitives like tff.federated_reduce, which allow you to run arbitrary TensorFlow "in the stream" between clients and server. If the above reading of the desired algorithm is incorrect and something more involved is needed, or you wish to flexibly experiment with totally different notions of aggregation (something TFF encourages and is designed to support), this may be the option for you.
In TFF's heuristic typing language, tff.federated_reduce has signature:
<{T}#CLIENTS, U, (<U, T> -> U)> -> U#SERVER
Meaning, federated_reduce take a value of type T placed at the clients, a 'zero' in a reduction algebra of type U, and a function accepting a U and a T and producing a U, and applies this function 'in the stream' on the way between clients and server, producing a U placed at the server. The function (<U, T> -> U) will be applied to the partially accumulated value U, and the 'next' element in the stream T (note however that TFF does not guarantee ordering of these values), returning another partially accumulated value U. The 'zero' should represent whatever 'partially accumulated' means over the empty set in your application; this will be the starting point of the reduction.
Application to this problem
The components
Your reduction function needs access to two pieces of data: the global model state and the result of training on a given client. This maps quite nicely to the type T. In this application, we will have something like:
T = <server_model=server_model_type, trained_model=trained_model_type>
These two types are likely to be the same, but may not necessarily be so.
Your reduction function will accept the partial aggregate, your server model and your client-trained model, returning a new partial aggregate. Here we will start assuming the same reading of the algorithm as above, that of a weighted mean with particular weights. Generally, the easiest way to compute a mean is to keep two accumulators, one for numerator and one for denominator. This will affect the choice of zero and reduction function below.
Your zero should contain a structure of tensors with value 0 mapping to the weights of your model--this will be the numerator. This would be generated for you if you had an aggregation like tff.federated_sum (as TFF knows what the zero should be), but for this case you'll have to get your hands on such a tensor yourself. This shouldn't be too hard with tf.nest.map_structure and tf.zeros_like.
For the denominator, we will assume we just need a scalar. TFF and TF are much more flexible than this--you could keep a per-layer or per-parameter denominator if desired--but for simplicity we will assume that we just want to divide by a single float in the end.
Therefore our type U will be something like:
U = <numerator=server_model_type, denominator=tf.float32>
Finally we come to our reduction function. It will be more or less a different composition of the same pieces above; we will make slightly tighter assumptions about them here (in particular, that all the local functions are tff.tf_computations--a technical assumption, arguably a bug on TFF). Our reduction function will be along the lines (assuming appropriate type aliases):
#tff.tf_computation(U, T)
def reduction(partial_accumulate, next_element):
kl_div = compute_kl_divergence(
next_element.server_model, next_element.trained_model)
weight = postprocess_divergence(kl_div)
new_numerator = partial_accumulate.numerator + weight * next_element.trained_model
new_denominator = partial_accumulate.denominator + weight
return collections.OrderedDict(
numerator=new_numerator, denominator=new_denominator)
Putting them together
The basic outline of a round will be similar to the above; but we have put more computation 'in the stream', and consequently there wil be less on the clients. We assume here the same variable definitions.
#tff.federated_computation(...)
def round_function(...):
...
trained_clients = tff.federated_map(training_fn, clients_placed_arguments)
global_model_at_clients = tff.federated_broadcast(server_model)
# This zip I believe is not necessary, but it helps my mental model.
reduction_arg = tff.federated_zip(
collections.OrderedDict(server_model=global_model_at_clients,
trained_model=trained_clients))
# We assume a zero as specified above
return tff.federated_reduce(reduction_arg,
zero,
reduction)

Learning parameters of each simulated device

Does tensorflow-federated support assigning different hyper-parameters(like batch-size or learning rate) for different simulated devices?
Currently, you may find this a bit unnatural, but yes, such a thing is possible.
One approach to doing this that is supported today is to have each client take its local learning rate as a top-level parameter, and use this in the training. A dummy example here would be (sliding the model parameter in the computations below) something along the lines of
#tff.tf_computation(tff.SequenceTyoe(...), tf.float32)
def train_with_learning_rate(ds, lr):
# run training with `tf.data.Dataset` ds and learning rate lr
...
#tff.federated_computation(tff.FederatedType([tff.SequenceType(...), tf.float32])
def run_one_round(datasets_and_lrs):
return tff.federated_mean(
tff.federated_map(train_with_learning_rate, datasets_and_lrs))
Invoking the federated computation here with a list of tuples with the first element of the tuple representing the clients data and the second element representing the particular client's learning rate, would give what you want.
Such a thing requires writing custom federated computations, and in particular likely defining your own IterativeProcess. A similar iterative process definition was recently open sourced here, link goes to the relevant local client function definition to allow for learning rate scheduling on the clients by taking an extra integer parameter representing the round number, it is likely a good place to look.

How to penalize a predictor variable to decrease its feature importance

I have one predictor which is dominating my models, I still want to include it, but I want to down-weight its importance in the final model. Is there a good (sci)pythonic way to do this? I'm thinking maybe defining a custom PenaltyTransformer which introduces random noise into the variable, something like this:
class PenaltyTransformer(BaseEstimator,TransformerMixin):
def __init__(self, columns, scale=0.1):
self.scale = scale
self.columns = columns
def transform(self, X):
X[:,self.columns] += np.random.normal(loc=0, scale=self.scale, size=X[:,self.columns].shape)
return X
... does this make sense?
Without knowing anything about your application, it's hard to give a definitive answer on what you should do. A few options I can see:
Your noising approach in the question might be fine
You might use a model with higher bias and lower variance like a regularized linear model (I'm assuming you're doing GBM, RF, or something similar from the use of the term "importance")
You might exclude the highly predictive feature entirely
You might build a model that excludes the highly predictive feature entirely and then combines the resulting score with that feature in some way
You might also just want to accept that the strong feature is going to be the dominant factor in your model

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper, the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?
I have done a continuous action of Pendulum and it works well.
Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.
The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:
normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy() # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v
When you wanna sample an action, use the function tensorflow gives:
sampled_action = normal_dist.sample(1)
The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py
I was hung up on this for a long time, hopefully this helps someone in my shoes:
Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.
In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.
That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.
If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.

Learning rate doesn't change for AdamOptimizer in TensorFlow

I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).
Here is a code snippet from what I have so far:
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
sess.run(tf.initialize_all_variables())
for i in range(0, 10000):
sess.run(train_op)
print sess.run(optimizer._lr_t)
If I run the code I constantly get the initial learning rate (1e-3) i.e. I see no change.
What is a correct way for getting the learning rate at every step?
I would like to add that this question is really similar to mine. However, I cannot post my findings in the comment section there since I do not have enough rep.
I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.
So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:
alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)
BACKGROUND
As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].
By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.
Further inspecting the optimizer's code, in line 211, we see the following update:
update_beta1 = self._beta1_power.assign(
self._beta1_power * self._beta1_t,
use_locking=self._use_locking)
Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.
But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).
So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:
SOLUTION
optimizer = tf.train.AdamOptimizer()
# rest of the graph...
# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)
The variable at holds the alpha_t for the current training step.
DISCLAIMER
I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.
Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.
Hope it helps! Cheers,
Andres