Defining assignment function as variable in tensroflow? - tensorflow

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?

First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

Related

Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

Added data reducing accuracy

I'm running into a situation where giving a neural network extra data reduces accuracy, and I can't see how that's possible.
Suppose you train a neural network - just a binary classifier - on a set of examples that have, say, 10 variables each. And it learns to classify both training and tests sets quite accurately. Then rerun with the same examples but extra variables on each example, say an extra 20 variables each. Maybe the extra variables don't give as good a signal as the original ones, but it's still getting the original variables too. Worst-case scenario, it should just take a bit more time learning to ignore the extra variables, right? On the face of it, there shouldn't be any way for the accuracy to be less?
To go through everything I can think of:
It's the same set of records in each case.
All the original variables are still there, just with some extra ones added.
It's not about overfitting; the network trained with the extra data is much less accurate on both the training and test sets.
I don't think it's about needing more time. It's been running for a long time now and showing no signs of making progress.
I've tried with the learning rate both unchanged and reduced, same result each way.
Using TensorFlow, simple feedforward network with one hidden layer, Adam optimizer. Code is at https://github.com/russellw/tf-examples/blob/master/multilayer.py and the most important section is
# Inputs and outputs
X = tf.placeholder(dtype, shape=(None, cols))
Y = tf.placeholder(dtype, shape=None,)
# Hidden layers
n1 = 3
w1 = tf.Variable(rnd((cols, n1)), dtype=dtype)
b1 = tf.Variable(rnd(n1), dtype=dtype)
a1 = tf.nn.sigmoid(tf.matmul(X, w1) + b1)
pr('layer 1: {}', n1)
# Output layer
no = 1
wo = tf.Variable(rnd((n1, no)), dtype=dtype)
bo = tf.Variable(rnd(no), dtype=dtype)
p = tf.nn.sigmoid(tf.squeeze(tf.matmul(a1, wo)) + bo)
tf.global_variables_initializer().run()
# Model
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.AdamOptimizer(args.learning_rate).minimize(cost)
tf.global_variables_initializer().run()
How is it possible for the extra data to make the network less accurate? What am I missing?
You are confusing between variables and (training) data. Variables are something you use to find, say weights and biases, that help the network learn from data. So by increasing variables you are increasing trainable units, which warrants more time obviously. After a certain threshold, your data might become insufficient for the network to learn/update these variables.
Extra data means simply more examples.
So in your case it seems that you've crossed that threshold (assuming you've waited for your NN learning, long enough before saying it's not learning anymore).
You may want to know a phenomenon called curse of dimensionality. From Wikipedia, it says:
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
https://en.wikipedia.org/wiki/Curse_of_dimensionality
It turns out to have at least something to do with the properties of the optimizer. Though Adam worked the best of all the optimizers in one case, in a slightly different case it fails badly where Ftrl solves the problem. I don't know why Adam has this failure mode, but current solution: make optimizer a parameter, use a batch file to loop through all of them.

fit_generator in keras: where is the batch_size specified?

Hi I don't understand the keras fit_generator docs.
I hope my confusion is rational.
There is a batch_size and also the concept of training in in batches. Using model_fit(), I specify a batch_size of 128.
To me this means that my dataset will be fed in 128 samples at a time, thereby greatly alleviating memory. It should allow a 100 million sample dataset to be trained as long as I have got the time to wait. After all, keras is only "working with" 128 samples at a time. Right?
But I highly suspect that for specifying the batch_size alone doesn't do what I want whatsoever. Tons of memory is still being used. For my goals I need to train in batches of 128 examples each.
So I am guessing this is what fit_generator does. I really want to ask why doesn't batch_size actually work as it's name suggests?
More importantly, if fit_generator is needed, where do I specify the batch_size? The docs say to loop indefinitely.
A generator loops over every row once. How do I loop over 128 samples at a time and remember where I last stopped and recall it the next time that keras asks for the next batch's starting row number (would be row 129 after first batch is done).
You will need to handle the batch size somehow inside the generator. Here is an example to generate random batches:
import numpy as np
data = np.arange(100)
data_lab = data%2
wholeData = np.array([data, data_lab])
wholeData = wholeData.T
def data_generator(all_data, batch_size = 20):
while True:
idx = np.random.randint(len(all_data), size=batch_size)
# Assuming the last column contains labels
batch_x = all_data[idx, :-1]
batch_y = all_data[idx, -1]
# Return a tuple of (Xs,Ys) to feed the model
yield(batch_x, batch_y)
print([x for x in data_generator(wholeData)])
First, keras batch_size does work very well. If you are working on GPU, you should know that the model can be very heavy with keras, especially if you are using recurrent cells. If you are working on CPU, the whole program is loaded in memory, the batch size won't have much of an impact on the memory. If you are using fit(), the whole dataset is probably loaded in memory, keras produces batches at every step. It's very difficult to predict the amount of memory that will be used.
As for the fit_generator() method, you should build a python generator function (using yield instead of return), yielding one batch at every step. The yield should be in an infinite loop (we often use while true: ...).
Do you have some code to illustrate your problem?

Tensorflow RNN sequence training

I'm making my first steps learning TF and have some trouble training RNNs.
My toy problem goes like this: a two layers LSTM + dense layer network is fed with raw audio data and should test whether a certain frequency is present in the sound.
so the network should 1 to 1 map float(audio data sequence) to float(pre-chosen frequency volume)
I've got this to work on Keras and seen a similar TFLearn solution but would like to implement this on bare Tensorflow in a relatively efficient way.
what i've done:
lstm = rnn_cell.BasicLSTMCell(LSTM_SIZE,state_is_tuple=True,forget_bias=1.0)
lstm = rnn_cell.DropoutWrapper(lstm)
stacked_lstm = rnn_cell.MultiRNNCell([lstm] * 2,state_is_tuple=True)
outputs, states = rnn.dynamic_rnn(stacked_lstm, in, dtype=tf.float32)
outputs = tf.transpose(outputs, [1, 0, 2])
last = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
network= tf.matmul(last, W) + b
# cost function, optimizer etc...
during training I fed this with (BATCH_SIZE, SEQUENCE_LEN,1) batches and it seems like the loss converged correctly but I can't figure out how to predict with the trained network.
My (awful lot of) questions:
how do i make this network return a sequence right from Tensorflow without going back to python for each sample(feed a sequence and predict a sequence of the same size)?
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
after some research:
how do i make this network return a sequence right from Tensorflow
without going back to python for each sample(feed a sequence and
predict a sequence of the same size)?
you can use state_saving_rnn
class Saver():
def __init__(self):
self.d = {}
def state(self, name):
if not name in self.d:
return tf.zeros([1,LSTM_SIZE],tf.float32)
return self.d[name]
def save_state(self, name, val):
self.d[name] = val
return tf.identity('save_state_name') #<-important for control_dependencies
outputs, states = rnn.state_saving_rnn(stacked_lstm, inx, Saver(),
('lstmstate', 'lstmstate2', 'lstmstate3', 'lstmstate4'),sequence_length=[EVAL_SEQ_LEN])
#4 states are for two layers of lstm each has hidden and CEC variables to restore
network = [tf.matmul(outputs[-1], W) for i in xrange(EVAL_SEQ_LEN)]
one problem is that state_saving_rnn is using rnn() and not dynamic_rnn() therefore unroll at compile time EVAL_SEQ_LEN steps you might want to re-implement state_saving_rnn with dynamic_rnn if you want to input long sequences
If I do want to predict one sample at a time and iterate in python what is the correct way to do it?
you can use dynamic_rnn and supply initial_state. this is probably just as efficient as state_saving_rnn. look at state_saving_rnn implementations for reference
During testing is dynamic_rnn needed or it's just used for unrolling for BPTT during training? why is dynamic_rnn returning all the back propagation steps Tensors? these are the outputs of each layer of the unrolled network right?
dynamic_rnn does do unrolling at runtime similarly to compile time rnn(). I guess it returns all the steps for you to branch the graph in some other places - after less time steps. in a network that use [one time step input * current state -> one output, new state] like the one described above it's not needed in testing but could be used for training truncated time back propagation

Tracking counts of examples used in training

I am trying to implement the CBOW word2vec model based on the skipgrams implementation on the tensorflow repository:
https://github.com/tensorflow/tensorflow/blob/v0.10.0/tensorflow/models/embedding/word2vec.py
I have previously implemented the simplified version following the TensorFlow tutorials, so I understand that I will have to modify the data batching function as well as a small part of the graph to get the context embedding.
In the skipgram implementation, the data batching function is used in lines 348-351.
(words, counts, words_per_epoch, self._epoch, self._words, examples,
labels) = word2vec.skipgram(filename=opts.train_data,
batch_size=opts.batch_size,
window_size=opts.window_size,
min_count=opts.min_count,
subsample=opts.subsample)
From my understanding, the variables assigned are as follows:
words: terms in the vocabulary
counts: associated counts of terms used in the corpus
words_per_epoch: total word count in the corpus
self._epoch: current count of epochs used
self._words: current count of training examples used
examples: current batch of training examples
labels: current batch of training labels
I have managed to replicate the tensor for words, counts, words_per_epoch, examples and labels. However, self._epoch and self._words have eluded me. If my understanding is correct, I need to be able to track the count of the training examples used. However, this is not provided by the sample batching function. The counts are later used in a multi-threaded manner to terminate the training loop, hence I can't simply use a loop to add up the counts.
I understand that bits of the tensorflow ops are implemented in C++. However, as I am not familiar with C++, I will have to replicate those parts using Python.
Will be great if I can get some suggestions to obtain the tensor for self._words. The tensor basically has to increment only when every time a new batch of examples/labels are called. With that, I can simply use a self._epoch = self._words // words_per_epoch to get the other tensor.
Figured out the trick while looking at the source code for tensorflow.models.embedding.word2vec_optimized.py. Specifically, how global_step was incremented when loss was called in lines 218-225.
In my case, I would have to do it as so:
# codes to prepare features and labels tensors
data_processed = tf.Variable(0, trainable=False, dtype=tf.int64)
epochs_processed = data_processed // data_per_epoch
inc_op = data_processed.assign_add(batch_size)
with tf.control_dependencies([inc_op]):
features_batch, labels_batch = tf.train.batch([features, labels],
batch_size=batch_size)
In this case, the tensor data_processed will always be incremented by batch_size whenever features_batch or labels_batch is called. epochs_processed will also be incremented accordingly.
The use of tf.control_dependencies(control_inputs) is key here. It returns a context manager. The operations specified in control_inputs must be executed before the operations defined in the context.