Different optimization behavior using np.random-normal instead of tf.random_normal - numpy

I’m looking into the code from https://github.com/AshishBora/csgm and experience some strange behavior when using np.random.normal instead of tf.random_normal as initializing of a tf.Variable. More concrete:
Instead of
z = tf.Variable(tf.random_normal((batch_size, hparams.n_z)), name='z')
I have
# in mnist_vae/src/model_def.py, line 74
z = tf.Variable(np.random.normal(size=(batch_size,
hparams.n_z)).astype('float32'), name='z')
z is the variable, which is optimized via Adam optimizer with respect to an objective.
For a little bit background: There is a pre-trained neural network G, whose input z is drawn from a standard normal distribution using tf.random_normal. For a given z*, one wants to solve ẑ= argmin_z ||AG(z)-AG(z*)|| and check the reconstruction error ||G(ẑ)-G(z*)||. The outcoming minimal value c(z*)=||G(ẑ)-G(z*)|| is for several different z* quite stable around a value c1. Now, I wasn’t quite sure whether the optimization (Adam optimizer) might use the information that z comes from a standard normal distribution. So I replaced the tf.random_normal by a np.random_normal in the hope that the optimizer can’t use the information then. (see the code above)
Unfortunately, the results are indeed different using np.random.normal: c(z*)=||G(ẑ)-G(z*)|| is for several different z* stable around a different value c2 (not c1). How can one explain this? Is it really that the optimizer uses the information of the normal distribution (e.g. as loglikelihood prior) in the optimization? My feeling says no, since it's only the initialization.
The code is given in https://github.com/AshishBora/csgm

Related

Multiple-input multiple-output CNN with custom loss function

I have a set of 2D input arrays m x n namely A,B,C and I have to predict two 2D output arrays namely d,e for which I do have the expected values. You can think of the inputs/outputs as grey images if you like.
Because of the spatial information is relevant (these are actually 2D physical domains) I want to use a Convolutional Neural Network to predict d and e. My design (not tested yet) looks as follows:
Because I have multiple inputs, I guess I should use multiple columns (or branches) to find different features for each of the inputs (they look fairly different). Each of these columns follows a encoding-decoding architecture used in segmentation (see SegNet): Conv2D block involves a convolution+batch normalisation+ReLU layer. Deconv2D involves a deconvolution+batch normalisation+ReLU.
Then, I can merge the output of each column by either concatenating, averaging or taking the maximum for example. To obtain the original m x n shape for each of the outputs I have seen I could do this with a 1 x 1 kernel convolution.
I want to predict the two outputs from that single layer. Is that okay from the network structure point of view? Finally my loss function depends on the outputs themselves compared to the target plus another relation I want to impose.
A would like to have some expert opinion on this since this is my first design of a CNN and I am not sure if I it makes sense as it is now and/or if there are better approaches (or network architectures) to this problem.
I posted this originally in datascience but I did not get much feedback. I am now posting it here since there is a bigger community on these topics plus I would be very grateful to receive implementation tips beside network architectural ones. Thanks.
I think your design makes sense in general:
since A, B, and C are fairly different, you make each input a transform sub-network, and then fuse them together, which is your intermediate representation.
from the intermediate representation, you apply additional CNN to decode D and E, respectively.
Several things:
A, B, and C looking different does not necessarily mean you can't stack them together as a 3-channel input. The decision should be made upon the fact that whether the values in A, B, and C mean differently or not. For example, if A is a gray scale image, B is a depth map, C is a also a gray image captured by a different camera. Then A and B are better processed in your suggested way, but A and C can be concatenated as one single input before feeding it to your network.
D and E are two outputs of the network and will be trained in the multi-task manner. Of course, they should share some latent feature, and one should split at this feature to apply a down-stream non-shared weight branch for each output. However, where to split is usually tricky.
It is really a broad question, asking for answers relying mostly on opinions. Here are my two cents though, which you might find interesting as it does not go along the previous answers here and on datascience.
First, I wouldn't go with separate columns for each input. AFAIK, when different inputs are processed by different columns, it is almost always the case that the network is some sort of Siemese network and the columns share the same weights; or at least the columns all need to produce a similar code. It is not your case here, so I would simply not bother.
Second, you are blessed with a problem with a dense output and no need to learn a code. This should direct you straight to U-nets, which outperforms any bottleneck-designed network without much effort. U-nets were introduced for dense segmentation but they shine at any dense-output problem really.
In short, just stack your inputs together and use a U-net.

Does TensorFlow gradient compute derivative of functions with unknown dependency on decision variable

I appreciate if you can answer my questions or provide me with useful resources.
Currently, I am working on a problem that I need to do alternating optimization. So, consider we have two decision variables x and y. In the first step I take the derivative of loss function wrt. x (for fixed y) and update x. On the second step, I need to take the derivative wrt. y. The issue is x is dependent on y implicitly and finding the closed form of cost function in a way to show the dependency of x on y is not feasible, so the gradients of cost function wrt. y are unknown.
1) My first question is whether "autodiff" method in reverse mode used in TensorFlow works for these problems where we do not have an explicit form of cost function wrt to one variable and we need the derivatives? Actually, the value of cost function is known but the dependency on decision variable is unknown via math.
2) From a general view, if I define a node as a "tf.Variable" and have an arbitrary intractable function(intractable via computation by hand) of that variable that evolves through code execution, is it possible to calculate the gradients via "tf.gradients"? If yes, how can I make sure that it is implemented correctly? Can I check it using TensorBoard?
My model is too complicated but a simplified form can be considered in this way: suppose the loss function for my model is L(x). I can code L(x) as a function of "x" during the construction phase in tensorflow. However, I have also another variable "k" that is initialized to zero. The dependency of L(x) on "k" shapes as the code runs so my loss function is L(x,k), actually. And more importantly, "x" is a function of "k" implicitly. (all the optimization is done using GradientDescent). The problem is I do not have L(x,k) as a closed form function but I have the value of L(x,k) at each step. I can use "numerical" methods like FDSA/SPSA but they are not exact. I just need to make sure as you said there is a path between "k" and L(x,k)but I do not know how!
TensorFlow gradients only work when the graph connecting the x and the y when you're computing dy/dx has at least one path which contains only differentiable operations. In general if tf gives you a gradient it is correct (otherwise file a bug, but gradient bugs are rare, since the gradient for all differentiable ops is well tested and the chain rule is fairly easy to apply).
Can you be a little more specific about what your model looks like? You might also want to use eager execution if your forward complication is too weird to express as a fixed dataflow graph.

Why is tf.transpose so important in a RNN?

I've been reading the docs to learn TensorFlow and have been struggling on when to use the following functions and their purpose.
tf.split()
tf.reshape()
tf.transpose()
My guess so far is that:
tf.split() is used because inputs must be a sequence.
tf.reshape() is used to make the shapes compatible (Incorrect shapes tends to be a common problem / mistake for me). I used numpy for this before. I'll probably stick to tf.reshape() now. I am not sure if there is a difference between the two.
tf.transpose() swaps the rows and columns from my understanding. If I don't use tf.transpose() my loss doesn't go down. If the parameter values are incorrect the loss doesn't go down. So the purpose of me using tf.transpose() is so that my loss goes down and my predictions become more accurate.
This bothers me tremendously because I'm using tf.transpose() because I have to and have no understanding why it's such an important factor. I'm assuming if it's not used correctly the inputs and labels can be in the wrong position. Making it impossible for the model to learn. If this is true how can I go about using tf.transpose() so that I am not so reliant on figuring out the parameter values via trial and error?
Question
Why do I need tf.transpose()?
What is the purpose of tf.transpose()?
Answer
Why do I need tf.transpose()? I can't imagine why you would need it unless you coded your solution from the beginning to require it. For example, suppose I have 120 student records with 50 stats per student and I want to use that to try and make a linear association with their chance of taking 3 classes. I'd state it like so
c = r x m
r = records, a matrix with a shape if [120x50]
m = the induction matrix. it has a shape of [50x3]
c = the chance of all students taking one of three courses, a matrix with a shape of [120x3]
Now if instead of making m [50x3], we goofed and made m [3x50], then we'd have to transpose it before multiplication.
What is the purpose of tf.transpose()?
Sometimes you just need to swap rows and columns, like above. Wikipedia has a fantastic page on it. The transpose function has some excellent properties for matrix math function, like associativeness and associativeness with the inverse function.
Summary
I don't think I've ever used tf.transpose in any CNN I've written.

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper, the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?
I have done a continuous action of Pendulum and it works well.
Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.
The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:
normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy() # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v
When you wanna sample an action, use the function tensorflow gives:
sampled_action = normal_dist.sample(1)
The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py
I was hung up on this for a long time, hopefully this helps someone in my shoes:
Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.
In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.
That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.
If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.

Learning rate doesn't change for AdamOptimizer in TensorFlow

I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).
Here is a code snippet from what I have so far:
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
sess.run(tf.initialize_all_variables())
for i in range(0, 10000):
sess.run(train_op)
print sess.run(optimizer._lr_t)
If I run the code I constantly get the initial learning rate (1e-3) i.e. I see no change.
What is a correct way for getting the learning rate at every step?
I would like to add that this question is really similar to mine. However, I cannot post my findings in the comment section there since I do not have enough rep.
I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.
So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:
alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)
BACKGROUND
As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].
By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.
Further inspecting the optimizer's code, in line 211, we see the following update:
update_beta1 = self._beta1_power.assign(
self._beta1_power * self._beta1_t,
use_locking=self._use_locking)
Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.
But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).
So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:
SOLUTION
optimizer = tf.train.AdamOptimizer()
# rest of the graph...
# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)
The variable at holds the alpha_t for the current training step.
DISCLAIMER
I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.
Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.
Hope it helps! Cheers,
Andres