does tensorflow 0.10.0rc version support float16? - tensorflow

In order to reduce the tensor, I defined all the variables with dytpe=tf.float16 in my Model, and then defined the optimizer:
optimizer = tf.train.AdamOptimizer(self.learning_rate)
self.compute_gradients = optimizer.compute_gradients(self.mean_loss_reg)
train_adam_op = optimizer.apply_gradients(self.compute_gradients, global_step=self.global_step)
Everything works ok! but after I run the train_adam_op, the the gradients and variables are nan in python. I wander If the apply_gradients() API supports tf.float16 type? Why I got nan after apply_gradients() was called by session.run()....

The dynamic range of fp16 is fairly limited compared to that of 32-bit floats. As a result, it's pretty easy to overflow or underflow them, which often results in the NaN that you've encountered.
You can insert a few check_numerics operations in your model to help pinpoint the specific operation(s) that becomes unstable when performed on fp16.
For example, you can wrap a L2 loss operation as follow to check that its result fits in an fp16
A = tf.l2_loss(some_tensor)
becomes
A = tf.check_numerics(tf.l2_loss(some_tensor), "found the root cause")
The most common source of overflows and underflows are the exp(), the log(), as well as the various classification primitives, so I would start looking there.
Once you've figured out which sequence of operations is problematic, you can update your model to perform that sequence using 32-bit floats by using tf.cast() to convert the inputs of the sequence to 32bit floats, and cast the result back to fp16.

Related

Difference between feature_column.embedding_column and keras.layers.Embedding in TensorFlow

I have been using keras.layers.Embedding for almost all of my projects. But, recently I wanted to fiddle around with tf.data and found feature_column.embedding_column.
From the documentation:
feature_column.embedding_column -
DenseColumn that converts from sparse, categorical input.
Use this when your inputs are sparse, but you want to convert them to a dense
representation (e.g., to feed to a DNN).
keras.layers.Embedding - Turns positive integers (indexes) into dense vectors of fixed size.
e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
My question is, is both of the api doing similar thing on different type of input data(for ex. input - [0,1,2] for keras.layers.Embedding and its one-hot-encoded rep. [[1,0,0],[0,1,0],[0,0,1] for feature_column.embedding_column)?
After reviewing source code for both operations here is what I found:
both operations rely on tensorflow.python.ops.embedding_ops funcitonality;
keras.layers.Embedding uses dense representations and contains generic keras code for fiddling with shapes, init variables etc;
feature_column.embedding_column relies on sparse and contains functionality to cache results.
So, your guess seems to be right: these 2 are doing similar things, rely on distinct input representations, contain some logic that doesn't change the essense of what they do.

Machine learning: why the cost function does not need to be derivable?

I was playing around with Tensorflow creating a customized loss function and this question about general machine learning arose to my head.
My understanding is that the optimization algorithm needs a derivable cost function to find/approach a minimum, however we can use functions that are non-derivable such as the absolute function (there is no derivative when x=0). A more extreme example, I defined my cost function like this:
def customLossFun(x,y):
return tf.sign(x)
and I expected an error when running the code, but it actually worked (it didn't learn anything but it didn't crash).
Am I missing something?
You're missing the fact that the gradient of the sign function is somewhere manually defined in the Tensorflow source code.
As you can see here:
def _SignGrad(op, _):
"""Returns 0."""
x = op.inputs[0]
return array_ops.zeros(array_ops.shape(x), dtype=x.dtype)
the gradient of tf.sign is defined to be always zero. This, of course, is the gradient where the derivate exists, hence everywhere but not in zero.
The tensorflow authors decided to do not check if the input is zero and throw an exception in that specific case
In order to prevent TensorFlow from throwing an error, the only real requirement is that you cost function evaluates to a number for any value of your input variables. From a purely "will it run" perspective, it doesn't know/care about the form of the function its trying to minimize.
In order for your cost function to provide you a meaningful result when TensorFlow uses it to train a model, it additionally needs to 1) get smaller as your model does better and 2) be bounded from below (i.e. it can't go to negative infinity). It's not generally necessary for it to be smooth (e.g. abs(x) has a kink where the sign flips). Tensorflow is always able to compute gradients at any location using automatic differentiation (https://en.wikipedia.org/wiki/Automatic_differentiation, https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_computation).
Of course, those gradients are of more use if you've chose a meaningful cost function isn't isn't too flat.
Ideally, the cost function needs to be smooth everywhere to apply gradient based optimization methods (SGD, Momentum, Adam, etc). But nothing's going to crash if it's not, you can just have issues with convergence to a local minimum.
When the function is non-differentiable at a certain point x, it's possible to get large oscillations if the neural network converges to this x. E.g., if the loss function is tf.abs(x), it's possible that the network weights are mostly positive, so the inference x > 0 at all times, so the network won't notice tf.abs. However, it's more likely that x will bounce around 0, so that the gradient is arbitrarily positive and negative. If the learning rate is not decaying, the optimization won't converge to the local minimum, but will bound around it.
In your particular case, the gradient is zero all the time, so nothing's going to change at all.
If it didn't learn anything, what have you gained ? Your loss function is differentiable almost everywhere but it is flat almost anywhere so the minimizer can't figure out the direction towards the minimum.
If you start out with a positive value, it will most likely be stuck at a random value on the positive side even though the minima on the left side are better (have a lower value).
Tensorflow can be used to do calculations in general and it provides a mechanism to automatically find the derivative of a given expression and can do so across different compute platforms (CPU, GPU) and distributed over multiple GPUs and servers if needed.
But what you implement in Tensorflow does not necessarily have to be a goal function to be minimized. You could use it e.g. to throw random numbers and perform Monte Carlo integration of a given function.

Complex convolution in tensorflow

I'm trying to run a simple convolution but with complex numbers:
r = np.random.random([1,10,10,10])
i = np.random.random([1,10,10,10])
x = tf.complex(r,i)
conv_layer = tf.layers.conv2d(
inputs=x,
filters=10,
kernel_size=[3,3],
kernel_initializer=utils.truncated_normal_complex(),
activation=tf.nn.sigmoid)
However I get this error:
TypeError: Value passed to parameter 'input' has DataType complex128 not in list of allowed values: float16, float32
Does anyone know how to implement such a convolution in Tensorflow?
Will I need to implement a custom op, or is there some better option here?
Frustratingly, complex matrix multiplication is possible, e.g. the following runs fine:
def r():
return np.random.random([10,10])
A = tf.complex(r(),r())
B = tf.complex(r(),r())
C = tf.multiply(A,B)
sess.run(C)
So there's no real reason convolution shouldn't work, I would think (as convolution is essentially just matrix multiplication).
Thanks
Probably too late but for anyone who still is interested: applying convolutions to complex valued data is not as straightforward as your usual data types, like float32. There are studies that investigat different network structures for this purpose (for example see this link for "Deep Complex U-Net"). There are implementations of these structures in pytorch and tensorflow.
All complex-valued features are split into either Cartesian (real, imaginary) or polar (modulus, angle) representations. Nobody is really trying to use a single feature that is purely complex; I would love to be proven wrong!

xgboost vs H2o Gradient Boosting

I have a dataset having a large missing values (more than 40% missing). Genrated a model in xgboost and H2o gradient boosting - got a decent model in both cases. However, the xgboost shows this variable as one of the key contributors to the model but as per H2o Gradient Boosting the variable is not important. Does xgboost handle variables with missing values differently. All the configuration to both the models are exactly the same.
Both missing value handling and variable importances are slightly different between the two methods. Both are treating missing values as information (i.e., they learn from them, and don't just impute with a simple constant). The variable importances are computed from the gains of their respective loss functions during tree construction. H2O uses squared error, and XGBoost uses a more complicated one based on gradient and hessian.
One thing you could check is the variance of the variable importances between different runs with different seeds, to see how stable each method is in terms of variable importances.
PS. If you have categoricals, then you're better off leaving the column as a factor for H2O, no need to do your own encoding. This can lead to a different effective count of columns vs XGBoost's dataset, so for column sampling, things will be different.

Force copy of tensor when enqueuing

first, I'm not sure if the title is very good, but it was the best I could come up with given my understanding of the situation.
The background is that I'm trying to understand how queues work in tensorflow and ran into the following issue which puzzled me.
I have a variable n, which I enqueue to a tf.FIFOQueue, and then I increment the variable. This is repeated several times, and one would expect a result similar to 0, 1, 2, ... However, when emptying the queue all values are the same.
More precisely, the code is as follows:
from __future__ import print_function
import tensorflow as tf
q = tf.FIFOQueue(10, tf.float32)
n = tf.Variable(0, trainable=False, dtype=tf.float32)
inc = n.assign(n+1)
enqueue = q.enqueue(n)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
sess.run(enqueue)
sess.run(inc)
sess.run(enqueue)
sess.run(inc)
sess.run(enqueue)
sess.run(inc)
print(sess.run(q.dequeue()))
print(sess.run(q.dequeue()))
print(sess.run(q.dequeue()))
Which I expect would print:
0.0
1.0
2.0
Instead I get the following result:
3.0
3.0
3.0
It seems like I'm pushing some pointer to n to the queue, instead of the actual value, which is what I want. However, I don't really have any actual understanding of tensorflow internals, so maybe something else is going on?
I tried changing
enqueue = q.enqueue(n)
to
enqueue = q.enqueue(tf.identity(n))
since answers to How can I copy a variable in tensorflow and In TensorFlow, what is tf.identity used for? gives me the impression that it might help, but it does not change the result. I also tried adding a tf.control_dependencies(), but again, all values are the same when dequeueing.
Edit: The output above is from running the code on a computer with a single CPU, when trying to see if there was some difference between different versions of tensorflow, I noticed if I run the code on a computer with CPU and GPU I get the "expected" result. Indeed, if I run with CUDA_VISIBLE_DEVICES="" I get the result above, and with CUDA_VISIBLE_DEVICES="0" I get the "expected" result.
To force a non-caching read you can do
q.enqueue(tf.add(q, 0))
This is what's currently done by the batch-normalization layer to force a copy.
Semantics of how variables get read vs. referenced are in the process of getting revamped so they are temporarily non-intuitive. In particular, I expected q.enqueue(v.read_value()) to force a non-caching read, but it doesn't fix your example on TF 0.12rc1
Using GPU machine puts variable on GPU, while Queue is CPU only, so enqueue op forces a GPU->CPU copy.
In case it helps, I've found that the other answers despite correct they do not work for all dtypes.
For example, this works fine with floats or ints but fails when n is a string tensor:
q.enqueue(tf.add(n, 0))
This one fails when the queue uses tuples with heterogeneous types (e.g., ints and floats):
q.enqueue_many([[n]])
So, if you see yourself caught in any of these situations try this instead:
q.enqueue(tf.add(n, tf.zeros_like(n)))
Or, to enqueue a tuple t:
q.enqueue([tf.add(n, tf.zeros_like(n)) for n in t])
That works even for string tensors and heterogeneous tuple types.
Hope it helps!
--
Update: it looks like tf.bool types do not work with tf.zeros_like(). For those, an explicit cast to an integer type might be needed.