TF Api Dataset: initialization - tensorflow

The tf.dataset works really greate, I was able to speed up learning ~2x. But I have still performance problem, the utilization of GPU is low (despite using tf.dataset with several workers).
My use case is following:
~400 of training examples, each have 10 input channels (take ~5GB)
The task is segmentation using ResNet50. The forward-backward take ~0.15s. Batch size = 32
The data loading is fast, take ~0.06s.
But after one epoch (400/32 ~= 13 iteration), the data loading take ~3.5 seconds, same like initialization of loader (it is more than processing all epoch). This make learning very slow.
My question is: is there are option to eliminate initialization after each epoch, just continuously feed the data ?
I was trying to set dataset.repeat(10) but it does no help.
The loading code and train is here: https://gist.github.com/melgor/0e681a4fe8f125d25573aa30d8ace5f3
The model is just ResNet transformed to Ecnoder-Decoder idea for image segmentation. The most of the code is taken from https://github.com/argman/EAST, but as here loading is very slow, I would like to transform it to TfRecords.

I partly resolve my problem with long initialization. I just make tge tfrecord file smaller.
In my base implementation I used raw string as images (so string from numpy array). The new 'tfrecord' contain compressed images using jpeg or png. Thanks to that it make the file 50x smaller what make initialization much faster. But there is also the cons of it: your images need to be uini8 (jpeg) or uint16 (png). In case of float, you can use uint16 but there will loss of information.
For encoding numpy array to compressed sting you can use Tensorflow itself:
encoded_jpeg = tf.image.encode_jpeg(tf.constant(img),format='rgb').eval(session=sess)
encoded_png = tf.image.encode_png(tf.constant(png_image)).eval(session=sess)

Related

How to Modify Data Augmentation in Darknet's YOLOv4

I built a digital scale reader using Darknet's YOLOv4Tiny. It is having trouble confusing 2's and 5's which leads me to believe that I am doing some unwanted data augmentation during training. (The results are mostly correct, and glare could be a factor, but I am expecting better results).
I have referenced this post:
Understanding darknet's yolo.cfg config files
and the darknet github:
https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section
Below is a link to the yolov4-tiny.cfg that I modified for my model:
https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-tiny.cfg
And a snippet from the link above:
[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=1
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
Am I correct that angle=0 means that there is no rotation?
Are there any other possible ways I might be augmenting my data that could cause an issue?
Edit: If I wanted, how could I eliminate all data augmentation?
Or do I just need more data (currently 2484 images for 10 digit classes)?
horizontal flip is applied by default, add "flip=0" to disable.
https://github.com/AlexeyAB/darknet
Here, the angle, saturation, exposure, and hue all are part of data augmentation. You can eliminate all data augmentation by setting the value to 0. As data argumentation's values are hyperparameters, modifying the value of these could lead the accuracy to better or worse both. Here I suggest you keep the value given by darknet the same as they found these values are good to get good accuracy. And by all these 4 types of data augmentation darknet initially generated all images. If you add more images without replication it is always good to add more images for the deep learning model to learn the necessary complexity during training.

Create round-robin sharding while generating sharded tfrecords

I am new to tensorflow and I am working on image segmentation problem in tensorflow 1.14. I have a huge dataset and generating tfrecords is very slow, when I try to generate one big tfrecord file. So, I would like to create 'n' shards of tfrecords. I could not find a way to do it online. Say I have 600 images and 600 masks. I want to generate 6 shards of tfrecords, with 100 images and 100 masks each in round robin fashion. A high level /pseudo-code of what I want is as follows -
sharded_tf_record_writer:
create n TFRecordWriter
----> for each_item in n TFRecordWriter
-----> write_example in round-robin fashion
I did search online and could not find relevant answer. I do not want to use apache beam for sharding. I appreciate any idea/help/guidance to achieve this.
I had asked the same question in one of the issues of tensorflow datasets and the user - Conchylicultor responded this -
Writing is done by _TFRecordWriter. Tfds will automatically compute the required number of shards and distribute examples across shards, However each shard is written sequentially.
You do not have control over the number of shards, it is also automatically computed.
However, the fact that examples are distributed between shards do not make the writing faster as examples are not pre-processed in parallel. If you want parallelism, then you'll have to use Apache Beam which allow to scale even to huge datasets
The link to the tensorflow/datasets issue is - https://github.com/tensorflow/datasets/issues/676
This might help.
Since you are working with object detection in tensorflow, there are some nice code in the official Tensorflow models repository that will do what you want. Note this code is for Tensorflow2 (not sure if it'll work in TF1)
See this example of writing sharded tfrecords from coco annotations. The idea is that you open up a list of TFRecordWriter in an exit stack (using contextlib2.ExitStack()), which will automatically close the TFRecords when each thread finishes writing to it.
The utility function open_sharded_output_tfrecords function creates this list of TFRecordWriter
import contextlib2
import tensorflow as tf
with contextlib2.ExitStack() as tf_record_close_stack, tf.gfile.GFile(
annotations_file, 'r'
) as fid:
output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
tf_record_close_stack, output_path, num_shards
)
Next you can use the ProcessPoolExecutor to write tfrecords into each shard in a round-robin fashion in parallel (4 workers in this example)
from concurrent.futures.process import ProcesPoolExecutor
with ProcessPoolExecutor(4) as executor:
for idx, image in enumerate(images):
futures = []
future = executor.submit(
_write_tf_record,
image,
idx,
num_shards,
output_tfrecords,
)
futures.append(future)
for future in futures:
future.result()
where _write_tf_record may look something like this:
def _write_tf_record(image, idx, num_shards, output_tfrecords)
tf_example = create_tf_example(image)
shard_idx = idx % num_shards
output_tfrecords[shard_idx].write(tf_example.SerializeToString())
Just make sure you have more shards than multiprocess workers, otherwise the same writer may be accessed by two different processes.

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.

Caching a dataset with examples of varied length

My dataset is comprised of audio segments of between 5-180 seconds. The number of examples is small enough to allow caching it in memory, instead of reading from the disk over and over. Storing the data in a constant tensor / variable and using tf.train.slice_input_producer will allow me to cache the dataset in memory, but it requires storing all the data in one matrix. Since some examples are much longer than others, this matrix might be unnecessarily large and perhaps too large for the RAM.
I can simply have a list of numpy arrays for my data, and do the whole input reading-randomizing-preprocessing in a non-tensforflow way with a feed_dict, but I wonder if there is a way to do it without completely giving up on tensorflow for the input reading-randomizing-preprocessing part.
Thanks!
The more recent tf.data library provides a tf.data.Dataset.cache method to cache an entire dataset into memory or into a file.
For instance:
dataset = ...
dataset = dataset.map(preprocessing_fn) # apply preprocessing
dataset = dataset.cache() # cache entire dataset in memory after preprocessing
I've provided more details on how to use cache() in this answer.

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.