I'd like to enforce symmetry in the weights within a Variable. I really want an approximate circular symmetry. However, I could imagine either row or column enforced symmetry.
The goal is to reduce training time by reducing the number of free variables. I know my problem would like a symmetric array but I might want to include both symmetric and "free" variables. I am using conv2d now, so I believe I need to keep using it.
Here is a function that creates a kernel symmetric with respect to reflection over its center row:
def SymmetricKernels(height,width,in_channels,out_channels,name=None):
half_kernels = tf.Variable(initial_value=tf.random_normal([(height+1)//2,width,in_channels,out_channels]))
half_kernels_reversed = tf.reverse(half_kernels[:(height//2),:,:,:],[0])
kernels = tf.concat([half_kernels,half_kernels_reversed],axis=0,name=name)
return kernels
Usage example:
w = SymmetricKernels(5,5,1,1)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
w_ = sess.run(w)
w_[:,:,0,0]
# output:
# [[-1.299 -1.835 -1.188 0.093 -1.736]
# [-1.426 -2.087 0.434 0.223 -0.65 ]
# [-0.217 -0.802 -0.892 -0.229 1.383]
# [-1.426 -2.087 0.434 0.223 -0.65 ]
# [-1.299 -1.835 -1.188 0.093 -1.736]]
The idea is to use tf.Variable() to create only the upper half variables of the kernels (half_kernels), and then form the symmetric kernels as a concatenation of the upper half and its reflected version.
This idea can be extended to create also kernels with both left-right and up-down symmetries.
Another thing you can try is to tie the net's hands by convolving twice, reusing the kernel but flipping it for the second convolution (untested code):
def symmetric_convolution(input_tensor, n_filters, size, name, dilations=[1,1,1,1]):
with tf.variable_scope("", reuse=tf.AUTO_REUSE):
kernel = tf.get_variable(shape=[*size, input_tensor.shape[-1], n_filters], name='conv_kernel_' + name, ...)
lr_flipped_kernel = tf.reverse(kernel, axis=[1], name='conv_kernel_flipped_lr_' + name)
conv_l = tf.nn.conv2d(input=input_tensor, filter=kernel, strides=[1, 1, 1, 1], padding='SAME', dilations=dilations)
conv_r = tf.nn.conv2d(input=input_tensor, filter=lr_flipped_kernel, strides=[1, 1, 1, 1], padding='SAME', dilations=dilations)
return tf.reduce_max(tf.concat([conv_l, conv_r], axis=-1), keepdims=True, axis=[-1])
You can add in biases, activations, etc. as needed. I've used something similar in the past – reduce_max will allow your kernel to take whatever shape, and effectively give you two convolutions for one; if you use reduce_sum instead, any asymmetries will average out quite quickly and your kernel will be symmetric. What works best will depend on your use case.
Related
I am using tf.GradientTape().gradient() to compute a representer point, which can be used to compute the "influence" of a given training example on a given test example. A representer point for a given test example x_t and training example x_i is computed as the dot product of their feature representations, f_t and f_i, multiplied by a weight alpha_i.
Note: The details of this approach are not necessary for understanding the question, since the main issue is getting gradient tape to work. That being said, I have included a screenshot of the some of the details below for anyone who is interested.
Computing alpha_i requires differentiation, since it is expressed as the following:
In the equation above L is the standard loss function (categorical cross-entropy for multiclass classification) and phi is the pre-softmax activation output (so its length is the number of classes). Furthermore alpha_i can be further broken up into alpha_ij, which is computed with respect to a specific class j. Therefore, we just obtain the pre-softmax output phi_j corresponding to the predicted class of the test example (class with highest final prediction).
I have created a simple setup with MNIST and have implemented the following:
def simple_mnist_cnn(input_shape = (28,28,1)):
input = Input(shape=input_shape)
x = layers.Conv2D(32, kernel_size=(3, 3), activation="relu")(input)
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
x = layers.Conv2D(64, kernel_size=(3, 3), activation="relu")(x)
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
x = layers.Flatten()(x) # feature representation
output = layers.Dense(num_classes, activation=None)(x) # presoftmax activation output
activation = layers.Activation(activation='softmax')(output) # final output with activation
model = tf.keras.Model(input, [x, output, activation], name="mnist_model")
return model
Now assume the model is trained, and I want to compute the influence of a given train example on a given test example's prediction, perhaps for model understanding/debugging purposes.
with tf.GradientTape() as t1:
f_t, _, pred_t = model(x_t) # get features for misclassified example
f_i, presoftmax_i, pred_i = model(x_i)
# compute dot product of feature representations for x_t and x_i
dotps = tf.reduce_sum(
tf.multiply(f_t, f_i))
# get presoftmax output corresponding to highest predicted class of x_t
phi_ij = presoftmax_i[:,np.argmax(pred_t)]
# y_i is actual label for x_i
cl_loss_i = tf.keras.losses.categorical_crossentropy(pred_i, y_i)
alpha_ij = t1.gradient(cl_loss_i, phi_ij)
# note: alpha_ij returns None currently
k_ij = tf.reduce_sum(tf.multiply(alpha_i, dotps))
The code above gives the following error, since alpha_ij is None: ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.. However, if I change t1.gradient(cl_loss_i, phi_ij) -> t1.gradient(cl_loss_i, presoftmax_i), it no longer returns None. Not sure why this is the case? Is there an issue with computing gradients on sliced tensors? Is there an issue with "watching" too many variables? I haven't worked much with gradient tape so I'm not sure what the fix is, but would appreciate help.
For anyone who is interested, here are more details:
I never see you watch any tensors. Note that the tape only traces tf.Variable by default. Is this missing from your code? Else I don't see how t1.gradient(cl_loss_i, presoftmax_i) is working.
Either way, I think the easiest way to fix it is to do
all_gradients = t1.gradient(cl_loss_i, presoftmax_i)
desired_gradients = all_gradients[[:,np.argmax(pred_t)]]
so simply do the indexing after the gradient. Note that this can be wasteful (if there are many classes) as you are computing more gradients than you need.
The explanation for why (I believe) your version doesn't work would be easiest to show in a drawing, but let me try to explain: Imagine the computations in a directed graph. We have
presoftmax_i -> pred_i -> cl_loss_i
Backpropagating the loss to the presoftmax is easy. But then you set up another branch,
presoftmax_i -> presoftmax_ij
Now, when you try to compute the gradient of the loss with respect to presoftmax_ij, there is actually no backpropagation path (we can only follow arrows backwards). Another way to think about it: You compute presoftmax_ij after computing the loss. How could the loss depend on it then?
I have a TensorFlow CNN model that is performing well and we would like to implement this model in hardware; i.e., an FPGA. It's a relatively small network but it would be ideal if it were smaller. With that goal, I've examined the kernels and find that there are some where the weights are quite strong and there are others that aren't doing much at all (the kernel values are all close to zero). This occurs specifically in layer 2, corresponding to the tf.Variable() named, "W_conv2". W_conv2 has shape [3, 3, 32, 32]. I would like to freeze/lock the values of W_conv2[:, :, 29, 13] and set them to zero so that the rest of the network can be trained to compensate. Setting the values of this kernel to zero effectively removes/prunes the kernel from the hardware implementation thus achieving the goal stated above.
I have found similar questions with suggestions that generally revolve around one of two approaches;
Suggestion #1:
tf.Variable(some_initial_value, trainable = False)
Implementing this suggestion freezes the entire variable. I want to freeze just a slice, specifically W_conv2[:, :, 29, 13].
Suggestion #2:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list)
Again, implementing this suggestion does not allow the use of slices. For instance, if I try the inverse of my stated goal (optimize only a single kernel of a single variable) as follows:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list = W_conv2[:,:,0,0]))
I get the following error:
NotImplementedError: ('Trying to optimize unsupported type ', <tf.Tensor 'strided_slice_2228:0' shape=(3, 3) dtype=float32>)
Slicing tf.Variables() isn't possible in the way that I've tried it here. The only thing that I've tried which comes close to doing what I want is using .assign() but this is extremely inefficient, cumbersome, and caveman-like as I've implemented it as follows (after the model is trained):
for _ in range(10000):
# get a new batch of data
# reset the values of W_conv2[:,:,29,13]=0 each time through
for m in range(3):
for n in range(3):
assign_op = W_conv2[m,n,29,13].assign(0)
sess.run(assign_op)
# re-train the rest of the network
_, loss_val = sess.run([optimizer, loss], feed_dict = {
dict_stuff_here
})
print(loss_val)
The model was started in Keras then moved to TensorFlow since Keras didn't seem to have a mechanism to achieve the desired results. I'm starting to think that TensorFlow doesn't allow for pruning but find this hard to believe; it just needs the correct implementation.
A possible approach is to initialize these specific weights with zeros, and modify the minimization process such that gradients won't be applied to them. It can be done by replacing the call to minimize() with something like:
W_conv2_weights = np.ones((3, 3, 32, 32))
W_conv2_weights[:, :, 29, 13] = 0
W_conv2_weights_const = tf.constant(W_conv2_weights)
optimizer = tf.train.RMSPropOptimizer(0.001)
W_conv2_orig_grads = tf.gradients(loss, W_conv2)
W_conv2_grads = tf.multiply(W_conv2_weights_const, W_conv2_orig_grads)
W_conv2_train_op = optimizer.apply_gradients(zip(W_conv2_grads, W_conv2))
rest_grads = tf.gradients(loss, rest_of_vars)
rest_train_op = optimizer.apply_gradients(zip(rest_grads, rest_of_vars))
tf.group([rest_train_op, W_conv2_train_op])
I.e,
Preparing a constant Tensor for canceling the appropriate gradients
Compute gradients only for W_conv2, then multiply element-wise with the constant W_conv2_weights to zero the appropriate gradients and only then apply gradients.
Compute and apply gradients "normally" to the rest of the variables.
Group the 2 train ops to a single training op.
It is really straightforward to see and understand the scalar values in TensorBoard. However, it's not clear how to understand histogram graphs.
For example, they are the histograms of my network weights.
(After fixing a bug thanks to sunside)
What is the best way to interpret these? Layer 1 weights look mostly flat, what does this mean?
I added the network construction code here.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 6, 10, 1])
tf.summary.image('input', x_image, 4)
# First layer of weights
with tf.name_scope("layer1"):
W1 = tf.get_variable("W1", shape=[input_size, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.matmul(X, W1)
layer1_act = tf.nn.tanh(layer1)
tf.summary.histogram("weights", W1)
tf.summary.histogram("layer", layer1)
tf.summary.histogram("activations", layer1_act)
# Second layer of weights
with tf.name_scope("layer2"):
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.matmul(layer1_act, W2)
layer2_act = tf.nn.tanh(layer2)
tf.summary.histogram("weights", W2)
tf.summary.histogram("layer", layer2)
tf.summary.histogram("activations", layer2_act)
# Third layer of weights
with tf.name_scope("layer3"):
W3 = tf.get_variable("W3", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer3 = tf.matmul(layer2_act, W3)
layer3_act = tf.nn.tanh(layer3)
tf.summary.histogram("weights", W3)
tf.summary.histogram("layer", layer3)
tf.summary.histogram("activations", layer3_act)
# Fourth layer of weights
with tf.name_scope("layer4"):
W4 = tf.get_variable("W4", shape=[hidden_layer_neurons, output_size],
initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.nn.softmax(tf.matmul(layer3_act, W4)) # Bug fixed: Qpred = tf.nn.softmax(tf.matmul(layer3, W4))
tf.summary.histogram("weights", W4)
tf.summary.histogram("Qpred", Qpred)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(tf.float32, [None, output_size], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")
# Loss function
# Sum (Ai*logp(yi|xi))
log_lik = -Y * tf.log(Qpred)
loss = tf.reduce_mean(tf.reduce_sum(log_lik * advantages, axis=1))
tf.summary.scalar("Q", tf.reduce_mean(Qpred))
tf.summary.scalar("Y", tf.reduce_mean(Y))
tf.summary.scalar("log_likelihood", tf.reduce_mean(log_lik))
tf.summary.scalar("loss", loss)
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
It appears that the network hasn't learned anything in the layers one to three. The last layer does change, so that means that there either may be something wrong with the gradients (if you're tampering with them manually), you're constraining learning to the last layer by optimizing only its weights or the last layer really 'eats up' all error. It could also be that only biases are learned. The network appears to learn something though, but it might not be using its full potential. More context would be needed here, but playing around with the learning rate (e.g. using a smaller one) might be worth a shot.
In general, histograms display the number of occurrences of a value relative to each other values. Simply speaking, if the possible values are in a range of 0..9 and you see a spike of amount 10 on the value 0, this means that 10 inputs assume the value 0; in contrast, if the histogram shows a plateau of 1 for all values of 0..9, it means that for 10 inputs, each possible value 0..9 occurs exactly once.
You can also use histograms to visualize probability distributions when you normalize all histogram values by their total sum; if you do that, you'll intuitively obtain the likelihood with which a certain value (on the x axis) will appear (compared to other inputs).
Now for layer1/weights, the plateau means that:
most of the weights are in the range of -0.15 to 0.15
it is (mostly) equally likely for a weight to have any of these values, i.e. they are (almost) uniformly distributed
Said differently, almost the same number of weights have the values -0.15, 0.0, 0.15 and everything in between. There are some weights having slightly smaller or higher values.
So in short, this simply looks like the weights have been initialized using a uniform distribution with zero mean and value range -0.15..0.15 ... give or take. If you do indeed use uniform initialization, then this is typical when the network has not been trained yet.
In comparison, layer1/activations forms a bell curve (gaussian)-like shape: The values are centered around a specific value, in this case 0, but they may also be greater or smaller than that (equally likely so, since it's symmetric). Most values appear close around the mean of 0, but values do range from -0.8 to 0.8.
I assume that the layer1/activations is taken as the distribution over all layer outputs in a batch. You can see that the values do change over time.
The layer 4 histogram doesn't tell me anything specific. From the shape, it's just showing that some weight values around -0.1, 0.05 and 0.25 tend to be occur with a higher probability; a reason could be, that different parts of each neuron there actually pick up the same information and are basically redundant. This can mean that you could actually use a smaller network or that your network has the potential to learn more distinguishing features in order to prevent overfitting. These are just assumptions though.
Also, as already stated in the comments below, do add bias units. By leaving them out, you are forcefully constraining your network to a possibly invalid solution.
Here I would indirectly explain the plot by giving a minimal example. The following code produce a simple histogram plot in tensorboard.
from datetime import datetime
import tensorflow as tf
filename = datetime.now().strftime("%Y%m%d-%H%M%S")
fw = tf.summary.create_file_writer(f'logs/fit/{filename}')
with fw.as_default():
for i in range(10):
t = tf.random.uniform((2, 2), 1000)
tf.summary.histogram(
"train/hist",
t,
step=i
)
print(t)
We see that generating a 2x2 matrix with a maximum range 1000 will produce values from 0-1000. To how this tensor might look, i am putting log of a few of them here.
tf.Tensor(
[[398.65747 939.9828 ]
[942.4269 59.790222]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[869.5309 980.9699 ]
[149.97845 454.524 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[967.5063 100.77594 ]
[ 47.620544 482.77008 ]], shape=(2, 2), dtype=float32)
We logged into tensorboard 10 times. The to right of the plot, a timeline is generated to indicate timesteps. The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older.
Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.
In TensorFlow, suppose I have an image, and I would like to apply the same convolutional kernel to it multiple times. For example, I'd like to do something like the following:
for i in range(5):
output = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(output, weights, [1,1,1,1]), bias))
But this runs out of memory very quickly if I have many loops. I think this is because the system is allocating enough memory to hold the entire sequence at once.
What is an alternative way of doing this?
Thanks!
The way tensorflow works is as follows:
When you create any operation, even if this operation is not used, it will not be destroyed. Therefore, through each iteration, you are creating 3 operations. So after 5 iterations, you end up with 15 operations. So in reality, this is not how to apply the same convolutional kernel to the input. So you should rather create the whole graph first. That is:
x = tf.placeholder(tf.float32, shape=[...], name='input_frames')
weights: tf.Variable(tf.truncated_normal([5, 5, 3, 128], stddev=1e-1, dtype=tf.float32, name='weights1'))
output = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(output, weights, [1,1,1,1]), bias))
for i in range(5):
sess.run(output, feed_dict={x: "your_image"})
Hope this helps.
I have been using Tensorflow with the l-bfgs optimizer from openopt. It was pretty easy to setup callbacks to allow Tensorflow to compute gradients and loss evaluations for the l-bfgs, however, I am having some trouble figuring out how to introduce stochastic elements like dropout into the training procedure.
During the line search, l-bfgs performs multiple evaluations of the loss function, which need to operate on the same network as the prior gradient evaluation. However, it seems that for each evaluation of the tf.nn.dropout function, a new set of dropouts is created. I am looking for a way to fix the dropout over multiple evaluations of the loss function, and then allow it to change between the gradient steps of the l-bfgs. I'm assuming this has something to do with the control flow ops in tensorflow, but there isn't really a good tutorial on how to use these and they are a little enigmatic to me.
Thanks for your help!
Drop-out relies on uses random_uniform which is a stateful op, and I don't see a way to reset it. However, you can hack around it by substituting your own random numbers and feeding them to the same input point as random_uniform, replacing the generated values
Taking the following code:
tf.reset_default_graph()
a = tf.constant([1, 1, 1, 1, 1], dtype=tf.float32)
graph_level_seed = 1
operation_level_seed = 1
tf.set_random_seed(graph_level_seed)
b = tf.nn.dropout(a, 0.5, seed=operation_level_seed)
Visualize the graph to see where random_uniform is connected
You can see dropout takes input of random_uniform through the Add op which has a name mydropout/random_uniform/(random_uniform). Actually the /(random_uniform) suffix is there for UI reasons, and the true name is mydropout/random_uniform as you can see by printing tf.get_default_graph().as_graph_def(). That gives you shortened tensor name. Now you append :0 to get actual tensor name. (side-note: operation could produce multiple tensors which correspond to suffixes :0, :1 etc. Since having one output is the most common case, :0 is implicit in GraphDef and node input is equivalent to node:0. However :0 is not implicit when using feed_dict so you have to explicitly write node:0)
So now you can fix the seed by generating your own random numbers (of the same shape as incoming tensor), and reusing them between invocations.
tf.reset_default_graph()
a = tf.constant([1, 1, 1, 1, 1], dtype=tf.float32)
graph_level_seed = 1
operation_level_seed = 1
tf.set_random_seed(graph_level_seed)
b = tf.nn.dropout(a, 0.5, seed=operation_level_seed, name="mydropout")
random_numbers = np.random.random(a.get_shape()).astype(dtype=np.float32)
sess = tf.Session()
print sess.run(b, feed_dict={"mydropout/random_uniform:0":random_numbers})
print sess.run(b, feed_dict={"mydropout/random_uniform:0":random_numbers})
You should see the same set of numbers with 2 run calls.