Data pipeline in tf.keras with tfrecords or numpy - tensorflow

I want to train a model in tf.keras of Tensorflow 2.0 with data that is bigger than my ram, but the tutorials only show examples with predefined datasets.
I followed this tutorial:
Load Images with tf.data, I could not make this work for data on numpy arrays or tfrecords.
This is an example with array being transformed into tensorflow datasets. What I want is to make this work for multiple numpy array files or multiple tfrecords files.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
# Shuffle and slice the dataset.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
# Since the dataset already takes care of batching,
# we don't pass a `batch_size` argument.
model.fit(train_dataset, epochs=3)

If you have tfrecords files:
path = ['file1.tfrecords', 'file2.tfrecords', ..., 'fileN.tfrecords']
dataset = tf.data.Dataset.list_files(path, shuffle=True).repeat()
dataset = dataset.interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=len(path))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding and any kind of augmentation.
In case with numpy arrays, you can construct dataset either from a list of filenames or from list of arrays. Labels are just a list. Or they could be taken from file while parsing single example.
path = #list of numpy arrays
or
path = os.listdir(path_to files)
dataset = tf.data.Dataset.from_tensor_slices((path, labels))
dataset = dataset.map(parse_function).batch()
parse_function handles decoding:
def parse_function(filename, label): #Both filename and label will be passed if you provided both to from_tensor_slices
f = tf.read_file(filename)
image = tf.image.decode_image(f))
image = tf.reshape(image, [H, W, C])
label = label #or it could be extracted from, for example, filename, or from file itself
#do any augmentations here
return image, label
To decode .npy files, the best way is to use reshape without read_file or decode_raw, but first load numpys with np.load:
paths = [np.load(i) for i in ["x1.npy", "x2.npy"]]
image = tf.reshape(filename, [2])
or try using decode_raw
f = tf.io.read_file(filename)
image = tf.io.decode_raw(f, tf.float32)
Then just pass batched dataset to model.fit(dataset). TensorFlow 2.0 allows simple iteration over dataset. No need to use iterator. Even in later versions of 1.x API you could just pass dataset to .fit method
for example in dataset:
func(example)

Related

How to apply the same preprocessing steps to a numpy array that keras applied to images during training?

Problem Setup
I have a collection of float32 numpy arrays (say: (100, 100) each) that belong to several classes.
I've created an image dataset from them by saving them to the disk (DATA_SET_PATH) using matplotlib.image.imsave(<save_path.jpg>, array, cmap='gray')
Then, I've trained a pretrained VGG model on that image dataset using the following.
from tensorflow.keras.applications.vgg19 import preprocess_input
augmenter = tf.keras.preprocessing.image.ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = augmenter.flow_from_directory(
<DATA_SET_PATH>,
target_size=(224, 224), # this is the input size that VGG model expects
color_mode='rgb',
... # other parameters
)
model = tf.keras.applications.VGG19(include_top=False,
weights='imagenet',
input_shape=(224, 224, 3)
)
# other model configurations ...
# ...
model.fit(train_generator , ...)
Now, in production, I am receiving a numpy array in the same format as in above (1) and I want to obtain the prediction for that single numpy array using model.predict().
Question
So, in this setup, how can I ensure a single numpy array (input) would transform to the state of a model input tensor during training?
What I tried:
import numpy as np
import cv2
input= np.random.randn(100, 100).astype(np.float32) # sample input array
# first resize the array
input = cv2.resize(input, (224, 224), cv2.INTER_AREA)
# make this has three channels (because VGG model has expected so during the training)
input = np.stack([input] * 3, axis=-1)
# pass through the respective preprocessing function
input = preprocess_input(input)
When I pass this to model.predict() after expanding the dimensions, the predictions are obviously wrong, despite having good performance during training.
I think this is due to the fact that the above input being different than what the model.input has received during training. If needed, I can save the input array to an image as in above (2), but I want to know the next steps that keras would apply on to it.
Edit:
Based on the insight by #Lescurel in a comment and looking the source of the tf.keras.preprocessing.image, I've used the the load_img() function and got this working by saving the array to an image and then loading it (to reproduce step 2 above and to make sure the preprocessing_function gets the values in the range 0-255).
Here's how I got it to work:
input= np.random.randn(100, 100).astype(np.float32) # sample input array
# save `input` to an image and load it
temp_path = "temp_path.jpg"
matplotlib.image.imsave(temp_path, input, cmap='gray')
img = tf.keras.preprocessing.image.load_img(temp_path,
color_mode='rgb',
target_size=(224, 224)
)
# convert to an array
input = tf.keras.preprocessing.image.img_to_array(img)
input = preprocess_input(input)
# the above `input` is passed to the model after adding the extra dimension.
# ...
For my use case, I would still prefer to avoid saving this to an image and directly transform the numpy array to preprocessing_function by ensuring its values are in (0, 255) range, but that will be the scope of another question :)
For the benefit of community providing solution here
Based on the insight by #Lescurel in a comment and looking the source
of the tf.keras.preprocessing.image, I've used the the load_img()
function and got this working by saving the array to an image and then
loading it (to reproduce step 2 above and to make sure the
preprocessing_function gets the values in the range 0-255).
Here's how I got it to work:
input= np.random.randn(100, 100).astype(np.float32) # sample input array
# save `input` to an image and load it
temp_path = "temp_path.jpg"
matplotlib.image.imsave(temp_path, input, cmap='gray')
img = tf.keras.preprocessing.image.load_img(temp_path,
color_mode='rgb',
target_size=(224, 224)
)
# convert to an array
input = tf.keras.preprocessing.image.img_to_array(img)
input = preprocess_input(input)
# the above `input` is passed to the model after adding the extra dimension.
# ...
(paraphrased from akilat90)

How to feed tfrecords to the networks in tensorflow

I want to feed my custom data deep network using "tfrecords" in tensorflow. There are several ways to do this; using coordinator or iterator. I was mixed up with this and tried several times using book's and blog's guide. But unfortunately, none of them did work for me.
Roughly speaking assume that I have tfrecords file and a model
How can we complete the following code to feed the training process with tfrecords?
for tf.Session() as sess:
# getting the images and labels
...
feed_dict = {x:images, y:labels}
loss = sess.run([optimization], feed_dict=feed_dict)
I looked at similar questions online, but those didn't answer what I am looking for.
Here is the official link:
https://www.tensorflow.org/api_guides/python/reading_data
You read TFRecord files with a tensorflow Dataset object or a TFRecordReader. I think this article explains it well (both writing and reading TFRecord files), using TFRecordReader (I prefer Dataset, myself).
Using dataset, your pipeline is something like this:
import tensorflow as tf
filenames = [...]
def _parse_fn(x):
feature_set = { 'image': tf.FixedLenFeature([], tf.string),
'label': tf.FixedLenFeature([], tf.int64)}
parsed = tf.parse_single_example(x, features= feature_set )
return parsed['image'], parsed['label']
dataset = (
tf.data
.TFRecordDataset(filenames)
.map(_parse_fn)
.shuffle(...)
.batch(...)
# etc.
)
iterator = dataset.make_one_shot_iterator()
next_item = iterator.get_next() # This is a nested structure of Tensors
image, label = next_item # Since our parse_fn returned a list of 2
... # now you build your model here, from the 'image' Tensor
# rather than from a placeholder as you do when using feed_dict
with tf.Session() as sess:
print sess.run([optimize])
The main thing to notice is that the output you will get from the Dataset or TFRecordReader is a Tensor, not a Numpy array object. Therefore, you will not use feed_dict, but rather build your model from the input tensor (as in, your model with extend directly from the image Tensor, rather than a Placeholder).

how to feed data in batches TensorFlow CNN?

Almost all examples on github or other blogs uses mnist dataset for demo. When I am trying to use same deep NN for my images data I encounter following problem.
They use:
batch_x, batch_y = mnist.train.next_batch(batch_size)
# Run optimization op (backprop)
sess.run(train_op, feed_dict={X: trainimg, Y: trainlabel, keep_prob: 0.8})
next_batch method to feed data in batches.
My question is:
Do we have any similar method to feed data in batches?
You should have a look at tf.contrib.data.Dataset. You can create an input pipeline: define the source, apply a transforation, and batch it. See the programmer's guide for importing data.
From the documentation:
The Dataset API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training
EDIT:
I guess what you have is an array of pictures (filenames). Here is an example from the programmer's guide.
Depending on your input files, the transformation part will change. Here is the extract for consuming an array of picture files.
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
# labels[i] is the label for the image in filenames[i].
labels = tf.constant([0, 37, ...])
dataset = tf.contrib.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
# Now you have a dataset of (image, label). Basically kind of a list with
# all your pictures encoded along with a label.
# Batch it.
dataset = dataset.batch(32)
# Create an iterator.
iterator = dataset.make_one_shot_iterator()
# Retrieve the next element.
image_batch, label_batch = iterator.get_next()
You could also shuffle your images.
Now you can use your image_batch and label_batch as placeholders in your model definition.

Loading folders of images in tensorflow

I'm new to tensorflow, but i already followed and executed the tutorials they promote and many others all over the web.
I made a little convolutional neural network over the MNIST images. Nothing special, but i would like to test on my own images.
Now my problem comes: I created several folders; the name of each folder is the class (label) the images inside belong.
The images have different shapes; i mean they have no fixed size.
How can i load them for using with Tensorflow?
I followed many tutorials and answers both here on StackOverflow and on others Q/A sites. But still, i did not figure out how to do this.
The tf.data API (tensorflow 1.4 onwards) is great for things like this. The pipeline will looks something like the following:
Create an initial tf.data.Dataset object that iterates over all examples
(if training) shuffle/repeat the dataset;
map it through some function that makes all images the same size;
batch;
(optionall) prefetch to tell your program to collect the preprocess subsequent batches of data while the network is processing the current batch; and
and get inputs.
There are a number of ways of creating your initial dataset (see here for a more in depth answer)
TFRecords with Tensorflow Datasets
Supporting tensorflow version 1.12 onwards, Tensorflow datasets provides a relatively straight-forward API for creating tfrecord datasets, and also handles data downloading, sharding, statistics generation and other functionality automatically.
See e.g. this image classification dataset implementation. There's a lot of bookeeping stuff in there (download urls, citations etc), but the technical part boils down to specifying features and writing a _generate_examples function
features = tfds.features.FeaturesDict({
"image": tfds.features.Image(shape=(_TILES_SIZE,) * 2 + (3,)),
"label": tfds.features.ClassLabel(
names=_CLASS_NAMES),
"filename": tfds.features.Text(),
})
...
def _generate_examples(self, root_dir):
root_dir = os.path.join(root_dir, _TILES_SUBDIR)
for i, class_name in enumerate(_CLASS_NAMES):
class_dir = os.path.join(root_dir, _class_subdir(i, class_name))
fns = tf.io.gfile.listdir(class_dir)
for fn in sorted(fns):
image = _load_tif(os.path.join(class_dir, fn))
yield {
"image": image,
"label": class_name,
"filename": fn,
}
You can also generate the tfrecords using lower level operations.
Load images via tf.data.Dataset.map and tf.py_func(tion)
Alternatively you can load the image files from filenames inside tf.data.Dataset.map as below.
image_paths, labels = load_base_data(...)
epoch_size = len(image_paths)
image_paths = tf.convert_to_tensor(image_paths, dtype=tf.string)
labels = tf.convert_to_tensor(labels)
dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))
if mode == 'train':
dataset = dataset.repeat().shuffle(epoch_size)
def map_fn(path, label):
# path/label represent values for a single example
image = tf.image.decode_jpeg(tf.read_file(path))
# some mapping to constant size - be careful with distorting aspec ratios
image = tf.image.resize_images(out_shape)
# color normalization - just an example
image = tf.to_float(image) * (2. / 255) - 1
return image, label
# num_parallel_calls > 1 induces intra-batch shuffling
dataset = dataset.map(map_fn, num_parallel_calls=8)
dataset = dataset.batch(batch_size)
# try one of the following
dataset = dataset.prefetch(1)
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
images, labels = dataset.make_one_shot_iterator().get_next()
I've never worked in a distributed environment, but I've never noticed a performance hit from using this approach over tfrecords. If you need more custom loading functions, also check out tf.py_func.
More general information here, and notes on performance here
Sample input pipeline script to load images and labels from directory. You could do preprocessing(resizing images etc.,) after this.
import tensorflow as tf
filename_queue = tf.train.string_input_producer(
tf.train.match_filenames_once("/home/xxx/Desktop/stackoverflow/images/*/*.png"))
image_reader = tf.WholeFileReader()
key, image_file = image_reader.read(filename_queue)
S = tf.string_split([key],'/')
length = tf.cast(S.dense_shape[1],tf.int32)
# adjust constant value corresponding to your paths if you face issues. It should work for above format.
label = S.values[length-tf.constant(2,dtype=tf.int32)]
label = tf.string_to_number(label,out_type=tf.int32)
image = tf.image.decode_png(image_file)
# Start a new session to show example output.
with tf.Session() as sess:
# Required to get the filename matching to run.
tf.initialize_all_variables().run()
# Coordinate the loading of image files.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in xrange(6):
# Get an image tensor and print its value.
key_val,label_val,image_tensor = sess.run([key,label,image])
print(image_tensor.shape)
print(key_val)
print(label_val)
# Finish off the filename queue coordinator.
coord.request_stop()
coord.join(threads)
File Directory
./images/1/1.png
./images/1/2.png
./images/3/1.png
./images/3/2.png
./images/2/1.png
./images/2/2.png
Output:
(881, 2079, 3)
/home/xxxx/Desktop/stackoverflow/images/3/1.png
3
(155, 2552, 3)
/home/xxxx/Desktop/stackoverflow/images/2/1.png
2
(562, 1978, 3)
/home/xxxx/Desktop/stackoverflow/images/3/2.png
3
(291, 2558, 3)
/home/xxxx/Desktop/stackoverflow/images/1/1.png
1
(157, 2554, 3)
/home/xxxx/Desktop/stackoverflow/images/1/2.png
1
(866, 936, 3)
/home/xxxx/Desktop/stackoverflow/images/2/2.png
2
For loading images of equal size just use this:
tf.keras.preprocessing.image_dataset_from_directory(dir)
docs: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image_dataset_from_directory
To load images with different shapes , tf provides a pipeline implementation (ImageGenerator):
from tensorflow.keras.preprocessing.image import ImageDataGenerator
TARGET_SHAPE = (500,500)
BATCH_SIZE = 32
train_dir = "train_images_directory" #ex: images/train/
test_dir = "train_images_directory" #ex: images/test/
train_images_generator = ImageDataGenerator(rescale=1.0/255,)
train_data_gen =
image_train_gen.flow_from_directory(batch_size=BATCH_SIZE,
directory=train_dir,
target_size=TARGET_SHAPE,
shuffle=True,
class_mode='sparse')
# do the same for validation and test dataset
# 1- image_generator 2- load images from directory with target shape

Tensorflow slim how to specify batch size during training

I'm trying to use slim interface to create and train a convolutional neural network, but I couldn't figure out how to specify the batch size for training.
During the training my net crashes because of "Out of Memory" on my graphic card.
So I think that should be a way to handle this condition...
Do I have to split the data and the labels in batches and then explicitly loop or the slim.learning.train is taking care of it?
In the code I paste train_data are all the data in my training set (numpy array)..and the model definition is not included here
I had a quick loop to the sources but no luck so far...
g = tf.Graph()
with g.as_default():
# Set up the data loading:
images = train_data
labels = tf.contrib.layers.one_hot_encoding(labels=train_labels, num_classes=num_classes)
# Define the model:
predictions = model7_2(images, num_classes, is_training=True)
# Specify the loss function:
slim.losses.softmax_cross_entropy(predictions, labels)
total_loss = slim.losses.get_total_loss()
tf.scalar_summary('losses/total loss', total_loss)
# Specify the optimization scheme:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)
train_tensor = slim.learning.create_train_op(total_loss, optimizer)
slim.learning.train(train_tensor,
train_log_dir,
number_of_steps=1000,
save_summaries_secs=300,
save_interval_secs=600)
Any hints suggestions?
Edit:
I re-read the documentation...and I found this example
image, label = MyPascalVocDataLoader(...)
images, labels = tf.train.batch([image, label], batch_size=32)
But It's not clear at all how to feed image and label to be passed to tf.train.batch... as MyPascalVocDataLoader function is not specified...
In my case my data set are loaded from a sqlite database and I have training data and labels as numpy array....still confused.
Of course I tried to pass my numpy arrays (converted to constant tensor) to the tf.train.batch like this
image = tf.constant(train_data)
label = tf.contrib.layers.one_hot_encoding(labels=train_labels, num_classes=num_classes)
images, labels = tf.train.batch([image, label], batch_size=32)
But seems not the right path to follow... it seems that the train.batch wants only one element from my data set...(how to pass this? it does not make sense to me to pass only train_data[0] and train_labels[0])
Here you can create the tfrecords which is the special type of binary file format used by the tensorflow. As you mentioned you have the training images and the labels, you can easily create the TFrecords for training and validation.
After creating the TFrecords, all you need to right is decode the images from the encoded TFrecords and give it to your model input. There you can select the batch size and all.