how to feed data in batches TensorFlow CNN? - tensorflow

Almost all examples on github or other blogs uses mnist dataset for demo. When I am trying to use same deep NN for my images data I encounter following problem.
They use:
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop)
sess.run(train_op, feed_dict={X: trainimg, Y: trainlabel, keep_prob: 0.8})
next_batch method to feed data in batches.
My question is:
Do we have any similar method to feed data in batches?

You should have a look at tf.contrib.data.Dataset. You can create an input pipeline: define the source, apply a transforation, and batch it. See the programmer's guide for importing data.
From the documentation:
The Dataset API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training
EDIT:
I guess what you have is an array of pictures (filenames). Here is an example from the programmer's guide.
Depending on your input files, the transformation part will change. Here is the extract for consuming an array of picture files.
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
# labels[i] is the label for the image in filenames[i].
labels = tf.constant([0, 37, ...])
dataset = tf.contrib.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
# Now you have a dataset of (image, label). Basically kind of a list with
# all your pictures encoded along with a label.
# Batch it.
dataset = dataset.batch(32)
# Create an iterator.
iterator = dataset.make_one_shot_iterator()
# Retrieve the next element.
image_batch, label_batch = iterator.get_next()
You could also shuffle your images.
Now you can use your image_batch and label_batch as placeholders in your model definition.

Related

Data pipeline in tf.keras with tfrecords or numpy

I want to train a model in tf.keras of Tensorflow 2.0 with data that is bigger than my ram, but the tutorials only show examples with predefined datasets.
I followed this tutorial:
Load Images with tf.data, I could not make this work for data on numpy arrays or tfrecords.
This is an example with array being transformed into tensorflow datasets. What I want is to make this work for multiple numpy array files or multiple tfrecords files.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
# Shuffle and slice the dataset.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
# Since the dataset already takes care of batching,
# we don't pass a `batch_size` argument.
model.fit(train_dataset, epochs=3)
If you have tfrecords files:
path = ['file1.tfrecords', 'file2.tfrecords', ..., 'fileN.tfrecords']
dataset = tf.data.Dataset.list_files(path, shuffle=True).repeat()
dataset = dataset.interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=len(path))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding and any kind of augmentation.
In case with numpy arrays, you can construct dataset either from a list of filenames or from list of arrays. Labels are just a list. Or they could be taken from file while parsing single example.
path = #list of numpy arrays
or
path = os.listdir(path_to files)
dataset = tf.data.Dataset.from_tensor_slices((path, labels))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding:
def parse_function(filename, label): #Both filename and label will be passed if you provided both to from_tensor_slices
f = tf.read_file(filename)
image = tf.image.decode_image(f))
image = tf.reshape(image, [H, W, C])
label = label #or it could be extracted from, for example, filename, or from file itself
#do any augmentations here
return image, label
To decode .npy files, the best way is to use reshape without read_file or decode_raw, but first load numpys with np.load:
paths = [np.load(i) for i in ["x1.npy", "x2.npy"]]
image = tf.reshape(filename, [2])
or try using decode_raw
f = tf.io.read_file(filename)
image = tf.io.decode_raw(f, tf.float32)
Then just pass batched dataset to model.fit(dataset). TensorFlow 2.0 allows simple iteration over dataset. No need to use iterator. Even in later versions of 1.x API you could just pass dataset to .fit method
for example in dataset:
func(example)

How to get batch size back from a tensorflow dataset?

It is recommended to use tensorflow dataset as the input pipeline which can be set up as follows:
# Specify dataset
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Suffle
dataset = dataset.shuffle(buffer_size=1e5)
# Specify batch size
dataset = dataset.batch(128)
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Get next batch
next_batch = iterator.get_next()
I should be able to get the batch size (either from dataset itself or from an iterator created from it, i.e. both iterator and next_batch). Maybe someone wants to know how many batches there are in the dataset or its iterators. Or how many batches have been called and how many remain in the iterator? One might also want to get particular elements, or even the entire dataset at once.
I wasn't able to find anything on the tensorflow documentation. Is this possible? If not, does anyone know if this has been requested as an issue on tensorflow GitHub?
Try this
import tensorflow as tf
import numpy as np
features=np.array([[3.0, 0.0], [1.0, 2.0], [0.0, 0.0]], dtype="float32")
labels=np.array([[0], [0], [1]], dtype="float32")
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
batch_size = 2
dataset = dataset.batch(batch_size)
iterator = dataset.make_initializable_iterator()
batch_data = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
print(np.shape(sess.run(batch_data)[0])[0])
and you will see
In TF2 at least, the type of a dataset is statically defined and accessible via tf.data.Dataset.element_spec.
This is a somewhat complex return type because it has tuple nesting that matches your Dataset.
>>> tf.data.Dataset.from_tensor_slices([[[1]],[[2]]]).element_spec.shape
TensorShape([1, 1])
If your data is organized as a tuple[image, label], then you'd get a tuple of TensorSpecs. You can index into it if you are certain of the nesting of the return type. E.g.
>>> image = tf.data.Dataset.from_tensor_slices([[1],[2],[3],[4]]).batch(2, drop_remainder=True)
>>> label = tf.data.Dataset.from_tensor_slices([[1],[2],[3],[4]]).batch(2, drop_remainder=True)
>>> train = tf.data.Dataset.zip((image, label))
>>> train.element_spec[0].shape[0]
2
In TF2, tf.data.Datasets are iterables, so you can get a batch by simply doing:
batch = next(iter(dataset))
and then calculating the batch size is trivial since it becomes the size of the first dimension:
batch_size = batch.shape[0]
So a complete example would look like:
# Specify dataset
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Suffle
dataset = dataset.shuffle(buffer_size=1e5)
# Specify batch size
dataset = dataset.batch(128)
# Calculate and print batch size
batch_size = next(iter(dataset)).shape[0]
print('Batch size:', batch_size) # prints 128
Or, if you need it as a function:
def calculate_batch_size(dataset):
return next(iter(dataset)).shape[0]
Note that iterating over a dataset requires eager execution. Moreover, this solution assumes that your dataset is batched, and may get errors if this is not the case. You may also face errors if, after batching, you perform other operations on your dataset that change the shape of its elements.

How does the tf.train.batch_join() function in tensorflow work?

I am trying to train a neural network in tensorflow. I load the data along with its labels using the tf.train.batch_join() fucntion. I do something like this:
image_batch, label_batch, image_batch_f = tf.train.batch_join(
images_and_labels, batch_size=batch_size_placeholder,
#shapes=[(args.image_size, args.image_size, 3), ()], enqueue_many=True,
shapes=[(args.image_height, args.image_width, 3), (), (args.image_height, args.image_width, 3)], enqueue_many=True,
capacity=4 * nrof_preprocess_threads * args.batch_size,
allow_smaller_final_batch=True)
image_batch = tf.identity(image_batch, 'image_batch')
image_batch = tf.identity(image_batch, 'input')
label_batch = tf.identity(label_batch, 'label_batch')
image_batch_f = tf.identity(image_batch_f, 'flipped_images_batch')
Here, I get three batches of data. A batch of images, a batch of labels and a batch of flipped images of the same images as in the batch of images. I want to extract features on the batch of images and flipped images. The lines below pass the batches of data through the network.
# Build the inference graph
prelogits, _ = network.inference(image_batch, args.keep_probability,
phase_train=phase_train_placeholder, feature_dimension=args.embedding_size,
weight_decay=args.weight_decay)
features = tf.nn.l2_normalize(prelogits, 1, 1e-10, name='embeddings')
#getting the flipped embeddings
prelogits_f, _ = network.inference(image_batch_f,args.keep_probability,
phase_train=phase_train_placeholder,feature_dimension=args.embedding_size,
weight_decay=args.weight_decay,reuse=True)
features_flipped_images = tf.nn.l2_normalize(prelogits_f,1,1e-10,name='embeddings_f')
For getting both the features, I run a session.run() on the features and features_flipped_images ops. Something like this:
feed_dict = {phase_train_placeholder:False, batch_size_placeholder:batch_size}
emb, emb_f = sess.run([features, features_flipped_images],feed_dict=feed_dict)
My question is the following. I am guessing when I do a session run on the features, that is when the batch_join function will dispatch a batch of images of size batch_size. But then when I do a session.run() on the features_flipped_images, that function will also get a batch of flipped images from the batch_join function. Does the batch_join function dispatch a fresh batch of flipped images when features_flipped_images is executed? Or is it the same batch of flipped images which was generated when features was executed? If not then how do I do this? I want to extract features on the batch of images and a batch of flipped images.
My guess is each run [features, features_flipped_images] will only get the same batch of data. Let's take an example:
imgs_batch,labels_batch = tf.train.batch([img, label]...)
then, if you want to see what's in the batch:
imgs_data, labels_data = sess.run([imgs_batch, labels_batch])
you see, it's similar when you run sess.run([features, features_flipped_images],..). I don't think you will get two batches, otherwise, the imgs_data and labels_data are not correspondence to each other.

Using multiple input pipelines in TensorFlow

I know how to use an input pipeline to read data from files:
input = ... # Read from file
loss = network(input) # build a network
train_op = ... # Using SGD or other algorithms to train the network.
But how can I switch between multiple input pipelines? Say, if I want to train a network for 1000 batches on the training set from the training pipeline, then validate it on a validation set from another pipeline, then keep training, then validate, then train, ..., and so forth.
It's easy to implement this with feed_dict. I also know how to use checkpoints to achieve this, just like in the cifar-10 example. But it's kind of cumbersome: I need to dump the model to disk then read it from disk again.
Can I just switch between two input pipelines (one for training data, one for validation data) to achieve this? Reading 1000 batches from the training data queue, then a few batched from the validation data queue, and so forth. If it is possible, how to do it?
Not sure if this is exactly what you are looking for, but I am doing training and validation in the same code in two separate loops. My code reads numeric and string data from .CSV files, not images. I am reading from two separate CSV files, one for training and one for validation. I'm sure you can generalize it to read from two 'sets' of files, rather than just single files, as the code is there.
Here are the code snippets in case it helps. Note that this code first reads everything as string and then converts the necessary cells into floats, just given my own requirements. If your data is purely numeric, you should just set the defaults to floats and all should be easier. Also, there are a couple of lines in there that drop Weights and Biases into a CSV file AND serialize them into the TF checkpoint file, depending on which way you'd prefer.
#first define the defaults:
rDefaults = [['a'] for row in range((TD+TS+TL))]
# this function reads line-by-line from CSV and separates cells into chunks:
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=False)
_, csv_row = reader.read(filename_queue)
data = tf.decode_csv(csv_row, record_defaults=rDefaults)
dateLbl = tf.slice(data, [0], [TD])
features = tf.string_to_number(tf.slice(data, [TD], [TS]), tf.float32)
label = tf.string_to_number(tf.slice(data, [TD+TS], [TL]), tf.float32)
return dateLbl, features, label
#this function loads the above lines and spits them out as batches of N:
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True)
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# These are the TRAINING features, labels, and meta-data to be loaded from the train file:
dateLbl, features, labels = input_pipeline(fileNameTrain, batch_size, try_epochs)
# These are the TESTING features, labels, and meta-data to be loaded from the test file:
dateLblTest, featuresTest, labelsTest = input_pipeline(fileNameTest, batch_size, 1) # 1 epoch here regardless of training
# then you define the model, start the session, blah blah
# fire up the queue:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
#This is the TRAINING loop:
try:
while not coord.should_stop():
dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])
_, acc, summary = sess.run([train_step, accuracyTrain, merged_summary_op], feed_dict={x: feature_batch, y_: label_batch,
keep_prob: dropout,
learning_rate: lRate})
except tf.errors.OutOfRangeError: # (so done reading the file(s))
# by the way, this dumps weights and biases into a CSV file, since you asked for that
np.savetxt(fPath + fIndex + '_weights.csv', sess.run(W),
# and this serializes weight and biases into the TF-formatted protobuf:
# tf.train.Saver({'varW': W, 'varB': b}).save(sess, fileNameCheck)
finally:
coord.request_stop()
# now re-start the runners for the testing file:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# so now this line reads features, labels, and meta-data, but this time from the training file:
dateLbl_batch, feature_batch, label_batch = sess.run([dateLblTest, featuresTest, labelsTest])
guessY = tf.argmax(y, 1).eval({x: feature_batch, keep_prob: 1})
trueY = tf.argmax(label_batch, 1).eval()
accuracy = round(tf.reduce_mean(tf.cast(tf.equal(guessY, trueY), tf.float32)).eval(), 2)
except tf.errors.OutOfRangeError:
acCumTest /= i
finally:
coord.request_stop()
coord.join(threads)
This may differ from what you are trying to do in the sense that it first completes the Training loop and THEN restarts the queues for the Testing loop. Not sure how you'd do this if you want to go back and fourth, but you can try to experiment with the two functions defined above by passing them the relevant file names (or lists) interchangeably.
Also I'm not sure if re-starting the queues after training is the best way to go, but it works for me. Would love to see a better example out there, as most TF examples use some built-in wrappers around the MNIST dataset to do the training in one go...

Tensorflow slim how to specify batch size during training

I'm trying to use slim interface to create and train a convolutional neural network, but I couldn't figure out how to specify the batch size for training.
During the training my net crashes because of "Out of Memory" on my graphic card.
So I think that should be a way to handle this condition...
Do I have to split the data and the labels in batches and then explicitly loop or the slim.learning.train is taking care of it?
In the code I paste train_data are all the data in my training set (numpy array)..and the model definition is not included here
I had a quick loop to the sources but no luck so far...
g = tf.Graph()
with g.as_default():
# Set up the data loading:
images = train_data
labels = tf.contrib.layers.one_hot_encoding(labels=train_labels, num_classes=num_classes)
# Define the model:
predictions = model7_2(images, num_classes, is_training=True)
# Specify the loss function:
slim.losses.softmax_cross_entropy(predictions, labels)
total_loss = slim.losses.get_total_loss()
tf.scalar_summary('losses/total loss', total_loss)
# Specify the optimization scheme:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)
train_tensor = slim.learning.create_train_op(total_loss, optimizer)
slim.learning.train(train_tensor,
train_log_dir,
number_of_steps=1000,
save_summaries_secs=300,
save_interval_secs=600)
Any hints suggestions?
Edit:
I re-read the documentation...and I found this example
image, label = MyPascalVocDataLoader(...)
images, labels = tf.train.batch([image, label], batch_size=32)
But It's not clear at all how to feed image and label to be passed to tf.train.batch... as MyPascalVocDataLoader function is not specified...
In my case my data set are loaded from a sqlite database and I have training data and labels as numpy array....still confused.
Of course I tried to pass my numpy arrays (converted to constant tensor) to the tf.train.batch like this
image = tf.constant(train_data)
label = tf.contrib.layers.one_hot_encoding(labels=train_labels, num_classes=num_classes)
images, labels = tf.train.batch([image, label], batch_size=32)
But seems not the right path to follow... it seems that the train.batch wants only one element from my data set...(how to pass this? it does not make sense to me to pass only train_data[0] and train_labels[0])
Here you can create the tfrecords which is the special type of binary file format used by the tensorflow. As you mentioned you have the training images and the labels, you can easily create the TFrecords for training and validation.
After creating the TFrecords, all you need to right is decode the images from the encoded TFrecords and give it to your model input. There you can select the batch size and all.