How to speed up ExternalOptimizerInterface in Tensorflow? - tensorflow

I just did a benchmark and ExternalOptimizerInterface from the Tensorflow optimization package is almost twice as slow as a normal optimizer.
It makes me wonder what the point of it is. ExternalOptimizerInterface clearly unfeasible for modern deep learning. Is there anyway to speed up ExternalOptimizerInterface?
Here's a snippet of my ExternalOptimizer:
def _minimize(self, initial_val, loss_grad_func, equality_funcs,
equality_grad_funcs, inequality_funcs, inequality_grad_funcs,
step_callback, optimizer_kwargs, packed_bounds=None):
self.t += 1
current_val = initial_val
_, grad = loss_grad_func(current_val)
delta = - grad * self.learning_rate
new_val = current_val + delta
return new_val

You have not mentioned if you are running it on a GPU or CPU. Nevertheless, the performance difference comes from the fact that GradientDescentOptimizer uses a single optimized kernel. Your implementation using ExternalOptimizerInterface is implemented using primitive operations and Tensorflow cannot optimize across kernels.
The underlying kernel ApplyGradientDescentOp is defined here
https://github.com/tensorflow/tensorflow/blob/4aa639c0cbb47f4707f735e0cc80f4c39506d928/tensorflow/core/kernels/training_ops.cc#L423 )
you can run both the implementations and compare them using a profiler such as tf-prof for more details

Related

tf.nn.rnn_cell.GRUCell were built on CPU device

I'm training a 2-layer seq2seq model now and gru_cell is used.
def create_rnn_cell():
encoDecoCell = tf.contrib.rnn.GRUCell(emb_dim)
encoDecoCell = tf.contrib.rnn.DropoutWrapper(
encoDecoCell,
input_keep_prob=1.0,
output_keep_prob=0.7
)
return encoDecoCell
encoder_mutil = tf.contrib.rnn.MultiRNNCell(
[create_rnn_cell() for _ in range(num_layers)],
)
query_encoder_emb = tf.contrib.rnn.EmbeddingWrapper(
encoder_mutil,
embedding_classes=vocab_size,
embedding_size=word_embedding
)
Timeline object is used to get the time of execution for each node in the graph and I found most operations inside GRU_cell (including MatMul) happened on CPU device which made it very slow. I installed the gpu version of tf-1.8. Any comments about this? Did I miss something here?
I guess there is something wrong with tf.variable_scope because I'm using different buckets for the training data. This is how I reuse the variable between different bucktes:
for i, bucket in enumerate(buckets):
with tf.variable_scope(name_or_scope="RNN_encoder", reuse=True if i > 0 else None) as var_scope:
query_output, query_state = tf.contrib.rnn.static_rnn(query_encoder_emb,inputs=self.query[:bucket[0]],dtype=tf.float32)
execution time screenshot
I found the problem. In the source code of EmbeddingWrapper, CPU is used.
tf.contrib.rnn.EmbeddingWrapper
I rewrote this function and now it works on GPU and is much faster. So be careful if you want to use tf.contrib.rnn.EmbeddingWrapper.

How does sklearn's standard DBSCAN run so fast?

I've been messing around with alternative implementations of DBSCAN for clustering radar data (like grid-based DBSCAN). Up to this point, I had been using sklearn's standard euclidean DBSCAN and it would run on 26,000 data points in less than a second. However, when I specify my own distance metric, like this:
X = np.column_stack((beam, gate, time_index))
num_pts = X.shape[0]
epsilons = np.array([[beam_eps]*num_pts, [gate_eps] * num_pts, [time_eps] * num_pts]).T
metric = lambda x, y, eps: np.sqrt(np.sum((x/eps - y/eps)**2))
def dist_metric(x, y, eps):
return np.sqrt(np.sum((x - y)**2))
db = DBSCAN(eps=eps, min_samples=minPts, metric=dist_metric, metric_params={'eps': epsilons}).fit(X)
it goes from 0.36 seconds to 92 minutes to run on the same data.
What I did in that code snippet can also be accomplished with just transforming the data beforehand and running standard Euclidean DBSCAN, but I'm trying to implement a reasonably fast version of Grid-based DBSCAN, for which the horizontal epsilon varies based on distance from the radar, so I won't be able to do that.
Part of the slowness in the above distance metric is because of that division by epsilon I think, because it only takes about a minute to run if I use a 'custom metric' that's just Euclidean distance:
metric = lambda x, y: np.sqrt(np.sum((x - y)**2))
How does sklearn's euclidean DBSCAN manage to run so much faster? I've been digging through the code, but haven't made sense of it so far.
Because it uses an index.
Furthermore, it avoids the slow and memory intensive Python interpreter, but does all the work in native code (compiled from Cython). This makes a huge difference when dealing with lots of primitive data such as doubles and ints that the Python interpreter would need to box.
Indexes make all the difference for similarity search. They can reduce the runtime from O(n²) to O(n log n).
But while the ball tree index allows custom metrics, the cost of invoking the python interpreter for every distance computation is very high, so if you really want a custom metric, edit the cython source code and compile sklearn yourself. Or you can use ELKI because the Java JVM can compile extension code into native code when necessary; it does not need to fallback to slow interpreter callbacks like sklearn.
In your case, it will likely be much better to rather preprocess the data. Scale it prior to clustering.

Tensorflow: Change placeholder during ScipyOptimizer

I am using the ScipyOptimizerInterface in tensorflow. I provide a minimal example below, where I optimize function f(x)=p*x**2+x for some placeholder p.
Now, I would like to gradually change the value of the placeholder during optimization, i.e. I want to change p in every step of the optimizer. Because I am using ScipyOptimizerInterface however, I only get the final result of the optimization, not after a single step.
Question: How can I gradually change p over time? Of course, I still want the optimization to run efficiently.
Motivation
In my actual use case, I want the final result of my optimization to satisfy some non-linear constraints. To ensure this, I introduce a penalty for violations of those constraints, and weight this penalty with p. By increasing p over time, I allow some violations initially, but ensure that in the end, the constraints are satisfied.
Minimal example
import tensorflow as tf
from tensorflow.contrib.opt import ScipyOptimizerInterface
# setup variables
x=tf.get_variable("x",initializer=[1.0])
p=tf.placeholder(dtype=tf.float32)
val=10.0
# setup optimization
f=p*x**2+x
optimizer = ScipyOptimizerInterface(f, options={'maxiter': 100})
# run
with tf.Session() as session:
init = tf.global_variables_initializer()
session.run(init, feed_dict={p:val})
optimizer.minimize(session, feed_dict={p:val})
ret=session.run(x)
print(ret)
If it matters: My tensorflow version is 1.4.1.

Is there any good reason to transpose a tensor from NHWC to NCHW?

I often see the transpose implementation in tensorflow code. I wonder why one would want to transpose the NHWC tensor to NCHW. Please give me the good example and the reason behind it.
Rather than citing the documentation. You should read into how CUDA works and think about how to implement most operations.
The reason for NCHW generally being faster than NHWC is how the CUDA kernels are written. In CUDA you need to specify what each thread is doing like
const int threads = 32;
dim3 block(threads, threads);
dim3 grid(up2(W / 2, threads), up2(H, threads), B);
kernel<Dtype> <<< grid, block>>> (args ...)
Here you get 3 indices threadId.z, threadId.y, threadId.x. And these threads are organized in warps (hardware design).
And you want to have coalesced memory transaction, which means the threads are ordered in such a way, that the GPU can nicely operate in a fast way.
To sum it up:
You want to have "threadId.x" being the most inner-loop and you should organize the data layout such that it reading them in coalesced way. The ideal data structure should accessible by
b * C * H * W + c * H * W + h * W + w
where lower letters denote the index and capitalized letters denotes the shape (e.g., 0 <= w < W).
In convolution operations (a part of the most used layer) what you are essentially doing is cropping a region in each channel computing a dot production with a region in another channel (from another tensor). So the indices which need to run crazy fast are the height-idx and width-idx. In the end, you are adding along the channel axis (like the convolution formulae suggest). This also explains, why it makes no difference to consider NWHC, NCWH.
This has an impact on how you order the data. And it is the reason you want to have the memory layout I described above.
The worst layout would be:
H, C, B, in threadId.z, threadId.y, threadId.x
The best layout would be:
B, C, H in threadId.z, threadId.y, threadId.x
The same is (mostly) true for GEMM as well (here one matrix should be transpose). There is no source for CuDNN available. But you might be interested in looking into cutlass.
From the performance guide of Tensorflow:
NHWC is the TensorFlow default and NCHW is the optimal format to use
when training on NVIDIA GPUs using cuDNN. [...] The brief history of these two formats is that TensorFlow started by using NHWC because it was a little faster on CPUs. In the long term, we are working on tools to auto rewrite graphs to make switching between the formats transparent and take advantages of micro optimizations where a GPU Op may be faster using NHWC than the normally most efficient NCHW.
Essentially, cuDNN is optimized for NCHW, while CPU-only tensorflow is optimized for NHWC. Switching from one to the other is just a matter of performance maximization and/or unavailability of certain operations in a specific data format.

What is the best way to implement weight constraints in TensorFlow?

Suppose we have weights
x = tf.Variable(np.random.random((5,10)))
cost = ...
And we use the GD optimizer:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
How can we implement for example non-negativity on weights?
I tried clipping them:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
session.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
But this slows down my training by a factor of 50.
Does anybody know a good way to implement such constraints on the weights in TensorFlow?
P.S.: in the equivalent Theano algorithm, I had
T.clip(x, 0, np.infty)
and it ran smoothly.
You can take the Lagrangian approach and simply add a penalty for features of the variable you don't want.
e.g. To encourage theta to be non-negative, you could add the following to the optimizer's objective function.
added_loss = -tf.minimum( tf.reduce_min(theta),0)
If any theta are negative, then add2loss will be positive, otherwise zero. Scaling that to a meaningful value is left as an exercise to the reader. Scaling too little will not exert enough pressure. Too much may make things unstable.
As of TensorFlow 1.4, there is a new argument to tf.get_variable that allows to pass a constraint function that is applied after the update of the optimizer. Here is an example that enforces a non-negativity constraint:
with tf.variable_scope("MyScope"):
v1 = tf.get_variable("v1", …, constraint=lambda x: tf.clip_by_value(x, 0, np.infty))
constraint: An optional projection function to be applied to the
variable
after being updated by an Optimizer (e.g. used to implement norm
constraints or value constraints for layer weights). The function must
take as input the unprojected Tensor representing the value of the
variable and return the Tensor for the projected value
(which must have the same shape). Constraints are not safe to
use when doing asynchronous distributed training.
By running
sess.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
you are consistently adding nodes to the graph and making it slower and slower.
Actually you may just define a clip_op when building the graph and run it each time after updating the weights:
# build the graph
x = tf.Variable(np.random.random((5,10)))
loss = ...
train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss)
clip_op = tf.assign(x, tf.clip(x, 0, np.infty))
# train
sess.run(train_op)
sess.run(clip_op)
I recently had this problem as well. I discovered that you can import keras which has nice weight constraint functions as use them directly in the kernen constraint in tensorflow. Here is an example of my code. You can do similar things with kernel regularizer
from keras.constraints import non_neg
conv1 = tf.layers.conv2d(
inputs=features['x'],
filters=32,
kernel_size=[5,5],
strides = 2,
padding='valid',
activation=tf.nn.relu,
kernel_regularizer=None,
kernel_constraint=non_neg(),
use_bias=False)
There is a practical solution: Your cost function can be written by you, to put high cost onto negative weights. I did this in a matrix factorization model in TensorFlow with python, and it worked well enough. Right? I mean it's obvious. But nobody else mentioned it so here you go. EDIT: I just saw that Mark Borderding also gave another loss and cost-based solution implementation before I did.
And if "the best way" is wanted, as the OP asked, what then? Well "best" might actually be application-specific, in which case you'd need to try a few different ways with your dataset and consider your application requirements.
Here is working code for increasing the cost for unwanted negative solution variables:
cost = tf.reduce_sum(keep_loss) + Lambda * reg # Cost = sum of losses for training set, except missing data.
if prefer_nonneg: # Optionally increase cost for negative values in rhat, if you want that.
negs_indices = tf.where(rhat < tf.constant(0.0))
neg_vals = tf.gather_nd(rhat, negs_indices)
cost += 2. * tf.reduce_sum(tf.abs(neg_vals)) # 2 is a magic number (empirical parameter)
You are free to use my code but please give me some credit if you choose to use it. Give a link to this answer on stackoverflow.com please.
This design would be considered a soft constraint, because you can still get negative weights, if you let it, depending on your cost definition.
It seems that constraint= is also available in TF v1.4+ as a parameter to tf.get_variable(), where you can pass a function like tf.clip_by_value. This seems like another soft constraint, not hard constraint, in my opinion, because it depends on your function to work well or not. It also might be slow, as the other answerer tried the same function and reported it was slow to converge, although they didn't use the constraint= parameter to do this. I don't see any reason why one would be any faster than the other since they both use the same clipping approach. So if you use the constraint= parameter then you should expect slow convergence in the context of the original poster's application.
It would be nicer if also TF provided true hard constraints to the API, and let TF figure out how to both implement that as well as make it efficient on the back end. I mean, I have seen this done in linear programming solvers already for a long time. The application declares a constraint, and the back end makes it happen.