Is it relevant to use both feature normalizer_fn and batch normalization? - tensorflow

Is it relevant to use both feature normalizer_fn and batch normalization like following ?
feature_columns_complex_standardized = [
tf.feature_column.numeric_column("my_feature", normalizer_fn=lambda x: (x - xMean) / xStd)
]
model1 = tf.estimator.DNNClassifier(feature_columns=feature_columns_complex_standardized,
hidden_units=[512,512,512],
optimizer=tf.train.AdamOptimizer(learning_rate=0.001, beta1= 0.9,beta2=0.99, epsilon = 1e-08,use_locking=False),
weight_column=weights,
dropout=0.5,
activation_fn=tf.nn.softmax,
n_classes=10,
label_vocabulary=Action_vocab,
model_dir='./Models9/Action/',
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE,
config=tf.estimator.RunConfig().replace(save_summary_steps=10),
batch_norm=True)

May be you get it wrong, as Normalization is one of the methods used to bring features in a dataset to the same scale, where batch normalization is used for solving the problem of internal covariate shift where each hidden unit’s input distribution changes every time there is a parameter update in the previous layer.
So you can use both at the same time.

Related

Training data in two steps with same accuracy?

I am trying to implement active learning machine(an experiment for a project) algorithm, where I want to train separately, please check my code below.
clf = BernoulliNB()
clf.fit(X_train[0:40], y_train[0:40])
clf.fit(X_train[40:], y_train[40:])
The above usually done like this
clf = BernoulliNB()
clf.fit(X_train, y_train)
Both have different accuracy score. I want to add training data to existing model itself since its computationally expensive - I don't want my model to do one more time computation.
Any way I can ?
You should use partial_fit to train your model in batches.
clf = BernoulliNB()
clf.partial_fit(X_train[0:40], y_train[0:40])
clf.partial_fit(X_train[40:], y_train[40:])
Please check this to know more about the function.
Hope this helps:)
This is called online training or Incremental learning used for large data. Please see this page for strategies.
Essentially, in scikit-learn, you need partial_fit() with all the labels in y known in advance.
partial_fit(X, y, classes=None, sample_weight=None)
classes : array-like, shape = [n_classes] (default=None)
List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
If you simply do this:
clf.partial_fit(X_train[0:40], y_train[0:40])
clf.partial_fit(X_train[40:], y_train[40:])
Then there is a possibility that that if any class which is not present in the first 40 samples, and comes in next iterations of partial_fit(), then it will throw an error.
So ideally you should be doing this:
# First call
clf.partial_fit(X_train[0:40], y_train[0:40], classes = np.unique(y_train))
# subsequent calls
clf.partial_fit(X_train[40:80], y_train[40:80])
clf.partial_fit(X_train[80:], y_train[80:])
and so on..

questions on defining and calling train.momentumoptimizer

I have some questions regarding the following code segment
def _optimizer(self,training_iters, global_step, opt_kwargs={}):
learning_rate = self.opt_kwargs.pop("learning_rate", 0.2)
decay_rate = self.opt_kwargs.pop("decay_rate", 0.95)
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate,
global_step=global_step,
decay_steps=training_iters,
decay_rate=decay_rate,
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node,
**self.opt_kwargs).minimize(self.net.cost,
global_step=global_step)
The input pararameter of opt_kwargs is setup as opt_kwargs=dict(momentum=0.2)
Why we need to use self.opt_kwargs.pop("learning_rate", 0.2) to assign learning_rate. My guess is that this way can inject the learning rate and decay rate information into the dict structure of opt_kwargs. But I don't see the real usage here.
Secondly, regarding tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node,
**self.opt_kwargs), looks like **self.opt_kwargs will pass the whole opt_kwargs dict into the MomentumOptimizer. However, according to tf.train.MomentumOptimizer.init(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False), it only needs the momentum value. Here, we are passing both learning_rate and decay_rate included in self.opt_kwargs. Is this a correct way?
1.) The argument pop is so that you extract the learning_rate and decay_rate value and feed it to exponential_decay(), which accepts them as individual argument. 2.) It's not clean but is ok to feed in a dict with extra entries. This makes it flexible so that ex. you can easily swap MomentumOptimizer with another optimizer that takes in decay_rate, etc as part of arguments.
tf.train.MomentumOptimizer.init(learning_rate, momentum, use_locking=False, name='Momentum', use_nesterov=False) This means you need to explicitly pass a momentum value to the function. For self.opt_kwargs.pop, you do not need to pass a "learning_rate" or "decay_rate" to your function since they are set default using 0.2 and 0.95.

How to use StreamingDataFeeder as contrib.learn.Estimator.fit()'s input_fn?

I have recently started using tensorflow.contrib.learn (skflow) library and really like it. However, I am facing an issue with using Estimator, the fit function uses either
(X, Y, and batch_size) - the problem with this approach is that it does not support provision for specifying number of epochs and allowing arbitrary source of data.
input_fn - besides, setting epochs, it gives me much more flexibility on source of training ( which in my case is coming directly from a database).
Now I am aware that I could create input_fn which reads files, however, as I am not interested in dealing with files, the following functions are not useful for me -
tf.contrib.learn.read_batch_examples
tf.contrib.learn.read_batch_features
tf.contrib.learn.read_batch_record_features
Ideally, I would like to use StreamingDataFeeder as input_fn. Any ideas how I can achieve this?
StreamingDataFeeder is used when you provide iterators as x / y to fit/predict/evaluate of Estimator.
Example:
x = (np.array([i]) for i in xrange(10**10)) # use range for python >=3.0
y = (np.array([i + 1]) for i in xrange(10**10))
lr = tf.contrib.learn.LinearRegressor(
feature_columns=[tf.contrib.layers.real_valued_column('')])
# only consumes 1000*10 values from iterators.
lr.fit(x, y, steps=1000, batch_size=10)
If you want to use input_fn for feeding data - you need to use graph operations to read / process data. For example you can create a C++ operation that will produce your data (it can be listening port or reading from database Op) and convert into Tensor. Mainly this is good for reading data from files, but other readers can be implemented as well.

Elementwise Sampling with map_fn Slow

Say that I want to sample a matrix with each entry sampled from a distribution defined by an entry in another matrix. I unroll my matrix and apply map_fn to each element. With a relatively small matrix (128 x 128), the following gives me several PoolAllocator warnings (GTX TITAN Black) and does not train in any reasonable amount of time.
def sample(x):
samples = tf.map_fn(lambda z:
tf.random_normal([1], mean=z,
stddev=tf.sqrt(z * (1 - z))),
tf.reshape(x, [-1])) # apply to each element
return tf.cond(is_training, lambda: tf.reshape(samples, shape=tf.shape(x)),
lambda: tf.tanh(x))
Is there a better way to apply an elementwise operation like this?
Your code will run much faster if you can use Tensor-at-a-time operations instead of elementwise operations like tf.map_fn.
Here it looks like you want to sample from a normal distribution for each element, where the parameters of the distribution are different for each value in an input tensor. Try something like this:
def sample(x):
samples = tf.random_normal(shape=[128, 128]) * tf.sqrt(x * (1 - x)) + x
tf.random_normal() generates a normal distribution with mean 0.0 and standard deviation 1.0 by default. You can use point-wise tensor operations to fix up the standard deviation (by multiplying) and the mean (by adding) for each element. In fact, if you look at how tf.random_normal() is implemented, that's precisely what it does internally.
(You would probably also do better using a Python conditional to distinguish training from test time.)
If you plan to do this sort of thing a lot, you might file a feature request on github asking to generalize tf.random_normal to accept Tensors with more general shapes for mean and stddev. I see no reason why that shouldn't be supported.
Hope that helps!
See the tensorflow.contrib.distributions module, which has a Normal class with a sample method that does this for you.

What is the best way to implement weight constraints in TensorFlow?

Suppose we have weights
x = tf.Variable(np.random.random((5,10)))
cost = ...
And we use the GD optimizer:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
How can we implement for example non-negativity on weights?
I tried clipping them:
upds = tf.train.GradientDescentOptimizer(lr).minimize(cost)
session.run(upds)
session.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
But this slows down my training by a factor of 50.
Does anybody know a good way to implement such constraints on the weights in TensorFlow?
P.S.: in the equivalent Theano algorithm, I had
T.clip(x, 0, np.infty)
and it ran smoothly.
You can take the Lagrangian approach and simply add a penalty for features of the variable you don't want.
e.g. To encourage theta to be non-negative, you could add the following to the optimizer's objective function.
added_loss = -tf.minimum( tf.reduce_min(theta),0)
If any theta are negative, then add2loss will be positive, otherwise zero. Scaling that to a meaningful value is left as an exercise to the reader. Scaling too little will not exert enough pressure. Too much may make things unstable.
As of TensorFlow 1.4, there is a new argument to tf.get_variable that allows to pass a constraint function that is applied after the update of the optimizer. Here is an example that enforces a non-negativity constraint:
with tf.variable_scope("MyScope"):
v1 = tf.get_variable("v1", …, constraint=lambda x: tf.clip_by_value(x, 0, np.infty))
constraint: An optional projection function to be applied to the
variable
after being updated by an Optimizer (e.g. used to implement norm
constraints or value constraints for layer weights). The function must
take as input the unprojected Tensor representing the value of the
variable and return the Tensor for the projected value
(which must have the same shape). Constraints are not safe to
use when doing asynchronous distributed training.
By running
sess.run(tf.assign(x, tf.clip_by_value(x, 0, np.infty)))
you are consistently adding nodes to the graph and making it slower and slower.
Actually you may just define a clip_op when building the graph and run it each time after updating the weights:
# build the graph
x = tf.Variable(np.random.random((5,10)))
loss = ...
train_op = tf.train.GradientDescentOptimizer(lr).minimize(loss)
clip_op = tf.assign(x, tf.clip(x, 0, np.infty))
# train
sess.run(train_op)
sess.run(clip_op)
I recently had this problem as well. I discovered that you can import keras which has nice weight constraint functions as use them directly in the kernen constraint in tensorflow. Here is an example of my code. You can do similar things with kernel regularizer
from keras.constraints import non_neg
conv1 = tf.layers.conv2d(
inputs=features['x'],
filters=32,
kernel_size=[5,5],
strides = 2,
padding='valid',
activation=tf.nn.relu,
kernel_regularizer=None,
kernel_constraint=non_neg(),
use_bias=False)
There is a practical solution: Your cost function can be written by you, to put high cost onto negative weights. I did this in a matrix factorization model in TensorFlow with python, and it worked well enough. Right? I mean it's obvious. But nobody else mentioned it so here you go. EDIT: I just saw that Mark Borderding also gave another loss and cost-based solution implementation before I did.
And if "the best way" is wanted, as the OP asked, what then? Well "best" might actually be application-specific, in which case you'd need to try a few different ways with your dataset and consider your application requirements.
Here is working code for increasing the cost for unwanted negative solution variables:
cost = tf.reduce_sum(keep_loss) + Lambda * reg # Cost = sum of losses for training set, except missing data.
if prefer_nonneg: # Optionally increase cost for negative values in rhat, if you want that.
negs_indices = tf.where(rhat < tf.constant(0.0))
neg_vals = tf.gather_nd(rhat, negs_indices)
cost += 2. * tf.reduce_sum(tf.abs(neg_vals)) # 2 is a magic number (empirical parameter)
You are free to use my code but please give me some credit if you choose to use it. Give a link to this answer on stackoverflow.com please.
This design would be considered a soft constraint, because you can still get negative weights, if you let it, depending on your cost definition.
It seems that constraint= is also available in TF v1.4+ as a parameter to tf.get_variable(), where you can pass a function like tf.clip_by_value. This seems like another soft constraint, not hard constraint, in my opinion, because it depends on your function to work well or not. It also might be slow, as the other answerer tried the same function and reported it was slow to converge, although they didn't use the constraint= parameter to do this. I don't see any reason why one would be any faster than the other since they both use the same clipping approach. So if you use the constraint= parameter then you should expect slow convergence in the context of the original poster's application.
It would be nicer if also TF provided true hard constraints to the API, and let TF figure out how to both implement that as well as make it efficient on the back end. I mean, I have seen this done in linear programming solvers already for a long time. The application declares a constraint, and the back end makes it happen.