Using multiple input pipelines in TensorFlow - tensorflow

I know how to use an input pipeline to read data from files:
input = ... # Read from file
loss = network(input) # build a network
train_op = ... # Using SGD or other algorithms to train the network.
But how can I switch between multiple input pipelines? Say, if I want to train a network for 1000 batches on the training set from the training pipeline, then validate it on a validation set from another pipeline, then keep training, then validate, then train, ..., and so forth.
It's easy to implement this with feed_dict. I also know how to use checkpoints to achieve this, just like in the cifar-10 example. But it's kind of cumbersome: I need to dump the model to disk then read it from disk again.
Can I just switch between two input pipelines (one for training data, one for validation data) to achieve this? Reading 1000 batches from the training data queue, then a few batched from the validation data queue, and so forth. If it is possible, how to do it?

Not sure if this is exactly what you are looking for, but I am doing training and validation in the same code in two separate loops. My code reads numeric and string data from .CSV files, not images. I am reading from two separate CSV files, one for training and one for validation. I'm sure you can generalize it to read from two 'sets' of files, rather than just single files, as the code is there.
Here are the code snippets in case it helps. Note that this code first reads everything as string and then converts the necessary cells into floats, just given my own requirements. If your data is purely numeric, you should just set the defaults to floats and all should be easier. Also, there are a couple of lines in there that drop Weights and Biases into a CSV file AND serialize them into the TF checkpoint file, depending on which way you'd prefer.
#first define the defaults:
rDefaults = [['a'] for row in range((TD+TS+TL))]
# this function reads line-by-line from CSV and separates cells into chunks:
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=False)
_, csv_row = reader.read(filename_queue)
data = tf.decode_csv(csv_row, record_defaults=rDefaults)
dateLbl = tf.slice(data, [0], [TD])
features = tf.string_to_number(tf.slice(data, [TD], [TS]), tf.float32)
label = tf.string_to_number(tf.slice(data, [TD+TS], [TL]), tf.float32)
return dateLbl, features, label
#this function loads the above lines and spits them out as batches of N:
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True)
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# These are the TRAINING features, labels, and meta-data to be loaded from the train file:
dateLbl, features, labels = input_pipeline(fileNameTrain, batch_size, try_epochs)
# These are the TESTING features, labels, and meta-data to be loaded from the test file:
dateLblTest, featuresTest, labelsTest = input_pipeline(fileNameTest, batch_size, 1) # 1 epoch here regardless of training
# then you define the model, start the session, blah blah
# fire up the queue:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
#This is the TRAINING loop:
try:
while not coord.should_stop():
dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])
_, acc, summary = sess.run([train_step, accuracyTrain, merged_summary_op], feed_dict={x: feature_batch, y_: label_batch,
keep_prob: dropout,
learning_rate: lRate})
except tf.errors.OutOfRangeError: # (so done reading the file(s))
# by the way, this dumps weights and biases into a CSV file, since you asked for that
np.savetxt(fPath + fIndex + '_weights.csv', sess.run(W),
# and this serializes weight and biases into the TF-formatted protobuf:
# tf.train.Saver({'varW': W, 'varB': b}).save(sess, fileNameCheck)
finally:
coord.request_stop()
# now re-start the runners for the testing file:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# so now this line reads features, labels, and meta-data, but this time from the training file:
dateLbl_batch, feature_batch, label_batch = sess.run([dateLblTest, featuresTest, labelsTest])
guessY = tf.argmax(y, 1).eval({x: feature_batch, keep_prob: 1})
trueY = tf.argmax(label_batch, 1).eval()
accuracy = round(tf.reduce_mean(tf.cast(tf.equal(guessY, trueY), tf.float32)).eval(), 2)
except tf.errors.OutOfRangeError:
acCumTest /= i
finally:
coord.request_stop()
coord.join(threads)
This may differ from what you are trying to do in the sense that it first completes the Training loop and THEN restarts the queues for the Testing loop. Not sure how you'd do this if you want to go back and fourth, but you can try to experiment with the two functions defined above by passing them the relevant file names (or lists) interchangeably.
Also I'm not sure if re-starting the queues after training is the best way to go, but it works for me. Would love to see a better example out there, as most TF examples use some built-in wrappers around the MNIST dataset to do the training in one go...

Related

how to feed data in batches TensorFlow CNN?

Almost all examples on github or other blogs uses mnist dataset for demo. When I am trying to use same deep NN for my images data I encounter following problem.
They use:
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop)
sess.run(train_op, feed_dict={X: trainimg, Y: trainlabel, keep_prob: 0.8})
next_batch method to feed data in batches.
My question is:
Do we have any similar method to feed data in batches?
You should have a look at tf.contrib.data.Dataset. You can create an input pipeline: define the source, apply a transforation, and batch it. See the programmer's guide for importing data.
From the documentation:
The Dataset API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training
EDIT:
I guess what you have is an array of pictures (filenames). Here is an example from the programmer's guide.
Depending on your input files, the transformation part will change. Here is the extract for consuming an array of picture files.
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
# labels[i] is the label for the image in filenames[i].
labels = tf.constant([0, 37, ...])
dataset = tf.contrib.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
# Now you have a dataset of (image, label). Basically kind of a list with
# all your pictures encoded along with a label.
# Batch it.
dataset = dataset.batch(32)
# Create an iterator.
iterator = dataset.make_one_shot_iterator()
# Retrieve the next element.
image_batch, label_batch = iterator.get_next()
You could also shuffle your images.
Now you can use your image_batch and label_batch as placeholders in your model definition.

tensorflow checkpoint with input pipeline

We have the following input pipeline:
with tf.name_scope('input'):
filename_queue = tf.train.string_input_producer(
[filename], num_epochs=num_epochs)
# Even when reading in multiple threads, share the filename
# queue.
image, label = read_and_decode(filename_queue)
# Shuffle the examples and collect them into batch_size batches.
# (Internally uses a RandomShuffleQueue.)
# We run this in two threads to avoid being a bottleneck.
images, sparse_labels = tf.train.shuffle_batch(
[image, label], batch_size=batch_size, num_threads=2,
capacity=1000 + 3 * batch_size,
# Ensures a minimum amount of shuffling of examples.
min_after_dequeue=1000)
return images, sparse_labels
and we have the following training:
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
start_time = time.time()
# Run one step of the model. The return values are
# the activations from the `train_op` (which is
# discarded) and the `loss` op. To inspect the values
# of your ops or variables, you may include them in
# the list passed to sess.run() and the value tensors
# will be returned in the tuple from the call.
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
# Print an overview fairly often.
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value,
duration))
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
# When done, ask the threads to stop.
coord.request_stop()
# Wait for threads to finish.
coord.join(threads)
sess.close()
I have two doubts:
1) Is the variable num_epochs deciding the number of training iterations?
2) My model is pretty large and i want to checkpoint and restore and train.
How do I know for a restored model how many iterations are done and how many are left?
1) as stated in the tensorflow api tf.train.string_input_producer will throw a tf.errors.OutOfRangeError once each string has been produced for num_epoch times. So yes, num_epochs will be deciding the number of training iterations in your code.
2) I think it might be possible to declare a tf.Variable and increase its value for each epoch that you run, so when you restore your model you could read that value again and train for the remaining epochs. Unfortunately i dont know if there is a smarter way, since most people only save their models for predictions after the training, or do finetuning for a fix number of epochs.
Hope i could help

rationale behind the evaluation in tensorflow's tutorial code cifar10_eval.py

In TF's official tutorial code 'cifar10', there is an evaluation snippet:
def evaluate():
with tf.Graph().as_default() as g:
# Get images and labels for CIFAR-10.
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate predictions.
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# Restore the moving average version of the learned variables for eval.
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
At runtime, it evaluates one batch of test samples and prints out 'precision' in the console every other eval_interval_secs, my questions are:
each time eval_once() is executed, one batch of samples (128) are dequeued from the data queue, but why I didn't see the evaluation stop after enough batches, 10000/128 + 1 = 79 batches? I thought it should stop after 79 batches.
Are batches from the first 79 sampling mutually exclusive? I'd assume so but want to double-check this.
If each batch is indeed dequeued from the data queue, what are the samples after 79 times of sampling? some random sampling from the entire duplicate data queue again?
since in_top_k() is taking in some unnormalized logit values and output a string of booleans, this masks the internal conversions of softmax() + thresholding. Is there a TF op for such explicit computations? Ideally, it'd be useful to be able to tune the threshold and see different classification results.
Please help.
Thanks!
You can see the following line in "inputs" def of cifar10_input.py
filename_queue = tf.train.string_input_producer(filenames)
More about tf.train.string_input_producer :
string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
num_epochs : produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
In our case, num_epochs is not specified. That's why it does not stop after few batches. It can run unlimited times.
By default, shuffle option is set to True in tf.train.string_input_producer. So, it shuffles the data first and copies that shuffled 10K filenames again and again.
Therefore, it's mutually exclusive. You can print filenames to see this.
As explained in 1, they are repeated samples. (not any random data)
You could avoid using tf.nn.in_top_k. Use tf.nn.softmax and tf.greater_equal to obtain boolean tensor that has softmax value above the specific threshold.
I hope this helps. Please comment if there is any misunderstanding.

tf.train.string_input_producer behavior in a loop

The following snippet has been taken from the TensorFlow 0.12 API documentation
def input_pipeline(filenames, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
filenames, num_epochs=num_epochs, shuffle=True)
example, label = read_my_file_format(filename_queue)
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a small safety margin) * batch_size
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
The question I have might be very basic for a regular TensorFlow user, but I am an absolute beginner. The question is the following :
tf.train.string_input_producer creates a queue for holding the filenames. As the input_pipeline() is called over and over again during training, how will it be ensured that everytime the same queue is used ? I guess, it is important since, if different calls to input_pipeline() result in a creation of a new queue, there does not seem to be a way to ensure that different images are picked everytime and epoch counter and shuffling can be properly maintained.
The input_pipeline function only creates the part of a (usually larger) graph that is responsible for producing batches of data. If you were to call input_pipeline twice - for whatever reason - you would be creating two different queues indeed.
In general, the function tf.train.string_input_producer actually creates a queue node (or operation) in the currently active graph (which is the default graph unless you specify something different). read_my_file_format then reads from that queue and sequentially produces single "example" tensors, while tf.train.shuffle_batch then batches these into bundles of length batch_size each.
However, the output of tf.train.shuffle_batch, two Tensors here that are returned from the input_pipeline function, only really takes on a (new) value when it is evaluated under a session. If you evaluate these tensors multiple times, they will contain different values - taken, through read_my_file_format, from files listed in the input queue.
Think of it like so:
X_batch, Y_batch = input_pipeline(..., batch_size=100)
with tf.Session() as sess:
sess.run(tf.global_variable_initializer())
tf.train.start_queue_runners()
# get the first 100 examples and labels
X1, Y1 = sess.run((X_batch, Y_batch))
# get the next 100 examples and labels
X2, Y2 = sess.run((X_batch, Y_batch))
# etc.
The boilerplate code to get it running is a bit more complex, e.g. because queues need to actually be started and stopped in the graph, because they will throw a tf.errors.OutOfRangeError when they run dry, etc.
A more complete example could look like this:
with tf.Graph().as_default() as graph:
X_batch, Y_batch = input_pipeline(..., batch_size=100)
prediction = inference(X_batch)
optimizer, loss = optimize(prediction, Y_batch)
coord = tf.train.Coordinator()
with tf.Session(graph=graph) as sess:
init = tf.group(tf.local_variable_initializer(),
tf.global_variable_initializer())
sess.run(init)
# start the queue runners
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# now you're really indirectly querying the
# queue; each iteration will see a new batch of
# at most 100 values.
_, loss = sess.run((optimizer, loss))
# you might also want to do something with
# the network's output - again, this would
# use a fresh batch of inputs
some_predicted_values = sess.run(prediction)
except tf.errors.OutOfRangeError:
print('Training stopped, input queue is empty.')
finally:
coord.request_stop()
# stop the queue(s)
coord.request_stop()
coord.join(threads)
For a deeper understanding, you might want to look at the Reading data documentation.

load multiple models in Tensorflow

I have written the following convolutional neural network (CNN) class in Tensorflow [I have tried to omit some lines of code for clarity.]
class CNN:
def __init__(self,
num_filters=16, # initial number of convolution filters
num_layers=5, # number of convolution layers
num_input=2, # number of channels in input
num_output=5, # number of channels in output
learning_rate=1e-4, # learning rate for the optimizer
display_step = 5000, # displays training results every display_step epochs
num_epoch = 10000, # number of epochs for training
batch_size= 64, # batch size for mini-batch processing
restore_file=None, # restore file (default: None)
):
# define placeholders
self.image = tf.placeholder(tf.float32, shape = (None, None, None,self.num_input))
self.groundtruth = tf.placeholder(tf.float32, shape = (None, None, None,self.num_output))
# builds CNN and compute prediction
self.pred = self._build()
# I have already created a tensorflow session and saver objects
self.sess = tf.Session()
self.saver = tf.train.Saver()
# also, I have defined the loss function and optimizer as
self.loss = self._loss_function()
self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)
if restore_file is not None:
print("model exists...loading from the model")
self.saver.restore(self.sess,restore_file)
else:
print("model does not exist...initializing")
self.sess.run(tf.initialize_all_variables())
def _build(self):
#builds CNN
def _loss_function(self):
# computes loss
#
def train(self, train_x, train_y, val_x, val_y):
# uses mini batch to minimize the loss
self.sess.run(self.optimizer, feed_dict = {self.image:sample, self.groundtruth:gt})
# I save the session after n=10 epochs as:
if epoch%n==0:
self.saver.save(sess,'snapshot',global_step = epoch)
# finally my predict function is
def predict(self, X):
return self.sess.run(self.pred, feed_dict={self.image:X})
I have trained two CNNs for two separate tasks independently. Each took around 1 day. Say, model1 and model2 are saved as 'snapshot-model1-10000' and 'snapshot-model2-10000' (with their corresponding meta files) respectively. I can test each model and compute its performance separately.
Now, I want to load these two models in a single script. I would naturally try to do as below:
cnn1 = CNN(..., restore_file='snapshot-model1-10000',..........)
cnn2 = CNN(..., restore_file='snapshot-model2-10000',..........)
I encounter the error [The error message is long. I just copied/pasted a snippet of it.]
NotFoundError: Tensor name "Variable_26/Adam_1" not found in checkpoint files /home/amitkrkc/codes/A549_models/snapshot-hela-95000
[[Node: save_1/restore_slice_85 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/restore_slice_85/tensor_name, save_1/restore_slice_85/shape_and_slice)]]
Is there a way to load from these two files two separate CNNs? Any suggestion/comment/feedback is welcome.
Thank you,
Yes there is. Use separate graphs.
g1 = tf.Graph()
g2 = tf.Graph()
with g1.as_default():
cnn1 = CNN(..., restore_file='snapshot-model1-10000',..........)
with g2.as_default():
cnn2 = CNN(..., restore_file='snapshot-model2-10000',..........)
EDIT:
If you want them into same graph. You'll have to rename some variables. One idea is have each CNN in separate scope and let saver handle variables in that scope e.g.:
saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), scope='model1')
and in cnn wrap all your construction in scope:
with tf.variable_scope('model1'):
...
EDIT2:
Other idea is renaming variables which saver manages (since I assume you want to use your saved checkpoints without retraining everything. Saving allows different variable names in graph and in checkpoint, have a look at documentation for initialization.
This should be a comment to the most up-voted answer. But I do not have enough reputation to do that.
Anyway.
If you(anyone searched and got to this point) still having trouble with the solution provided by lpp AND you are using Keras, check following quote from github.
This is because the keras share a global session if no default tf session provided
When the model1 created, it is on graph1
When the model1 loads weight, the weight is on a keras global session which is associated with graph1
When the model2 created, it is on graph2
When the model2 loads weight, the global session does not know the graph2
A solution below may help,
graph1 = Graph()
with graph1.as_default():
session1 = Session()
with session1.as_default():
with open('model1_arch.json') as arch_file:
model1 = model_from_json(arch_file.read())
model1.load_weights('model1_weights.h5')
# K.get_session() is session1
# do the same for graph2, session2, model2
You need to create 2 sessions and restore the 2 models separately. In order for this to work you need to do the following:
1a. When you're saving the models you need to add scopes to the variable names. That way you will know which variables belong to which model:
# The first model
tf.Variable(tf.zeros([self.batch_size]), name="model_1/Weights")
...
# The second model
tf.Variable(tf.zeros([self.batch_size]), name="model_2/Weights")
...
1b. Alternatively, if you already saved the models you can rename the variables by adding scope with this script.
2.. When you restore the different models you need to filter by variable name like this:
# The first model
sess_1 = tf.Session()
sess_1.run(tf.initialize_all_variables())
saver_1 = tf.train.Saver([v for v in tf.all_variables() if 'model_1' in v.name])
saver_1.restore(sess_1, weights_1_file)
sess_1.run(pred, feed_dict={image: X})
# The second model
sess_2 = tf.Session()
sess_2.run(tf.initialize_all_variables())
saver_2 = tf.train.Saver([v for v in tf.all_variables() if 'model_2' in v.name])
saver_2.restore(sess_2, weights_2_file)
sess_2.run(pred, feed_dict={image: X})
I encountered the same problem and could not solve the problem (without retraining) with any solution i found on the internet. So what I did is load each model in two separate threads which communicate with the main thread. It is simple enough to write the code, you just have to be careful when you synchronize the threads.
In my case each thread received the input for its problem and returned to the main thread the output. It works without any observable overhead.
One way is to clear your session if you want to train or load multiple models in succession. You can easily do this using
from keras import backend as K
# load and use model 1
K.clear_session()
# load and use model 2
K.clear_session()`
K.clear_session() destroys the current TF graph and creates a new one.
Useful to avoid clutter from old models / layers.