Too Much Memory Issue with Semantic Image Segmentation NN (DeepLabV3+) - tensorflow

I first explain my task: I have nearly 3000 images from two different ropes. They contain rope 1, rope 2 and the background. My Labels/Masks are images, where for example the pixel value 0 represents the background, 1 represents the first rope and 2 represents the second rope. You can see both the input picture and the ground truth/labels here on picture 1 and 2 below. Notice that my ground truth/label has only 3 values: 0, 1 and 2.
My input picture is gray, but for DeepLab i converted it to a RGB Picture, because DeepLab was trained on RGB Pictures. But my converted picture still doesn't contain color.
The idea of this task is that the Neural Network should learn the structure from ropes, so it can label ropes correctly even if there are knotes. Therfore the color information is not important, because my ropes have different color, so it is easy to use KMeans for creating the ground truth/labels.
For this task i choose a Semantic Segmentation Network called DeepLab V3+ in Keras with TensorFlow as Backend. I want to train the NN with my nearly 3000 images. The size of alle the images is under 100MB and they are 300x200 pixels.
Maybe DeepLab is not the best choice for my task, because my pictures doesn't contain color information and the size of my pictures are very small (300x200), but i didn't find any better Semantic Segmentation NN for my task so far.
From the Keras Website i know how to load the Data with flow_from_directory and how to use the fit_generator method. I don't know if my code is logical correct...
Here are the links:
My first question is:
With my implementation my graphic card used nearly all the memory (11GB). I don't know why. Is it possible, that the weights from DeepLab are that big? My Batchsize is default 32 and all my nearly 300 images are under 100MB big. I already used the config.gpu_options.allow_growth = True code, see my code below.
A general question:
Does somebody know a good semantic segmentation NN for my task? I don't need NN, which were trained with color images. But i also don't need NN, which were trained with binary ground truth pictures...
I tested my raw color image(picture 3) with DeepLab, but the result label i got was not good...
Here is my code so far:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
import numpy as np
from model import Deeplabv3
import tensorflow as tf
import time
import tensorboard
import keras
from keras.preprocessing.image import img_to_array
from keras.applications import imagenet_utils
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import TensorBoard
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
from keras import backend as K
NAME = "DeepLab-{}".format(int(time.time()))
deeplab_model = Deeplabv3(input_shape=(300,200,3), classes=3)
tensorboard = TensorBoard(log_dir="logpath/{}".format(NAME))
deeplab_model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])
# we create two instances with the same arguments
data_gen_args = dict(featurewise_center=True,
image_datagen = ImageDataGenerator(**data_gen_args)
mask_datagen = ImageDataGenerator(**data_gen_args)
# Provide the same seed and keyword arguments to the fit and flow methods
seed = 1, augment=True, seed=seed), augment=True, seed=seed)
image_generator = image_datagen.flow_from_directory(
mask_generator = mask_datagen.flow_from_directory(
# combine generators into one which yields image and masks
train_generator = zip(image_generator, mask_generator)
print("compiled"), y, batch_size=32, epochs=10, validation_split=0.3, callbacks=[tensorboard])
deeplab_model.fit_generator(train_generator, steps_per_epoch= np.uint32(2935 / 32), epochs=10, callbacks=[tensorboard])
print("finish fit")
Here is my code to test DeepLab (from Github):
from matplotlib import pyplot as plt
import cv2 # used for resize. if you dont have it, use anything else
import numpy as np
from model import Deeplabv3
import tensorflow as tf
from PIL import Image, ImageEnhance
deeplab_model = Deeplabv3(input_shape=(512,512,3), classes=3)
#deeplab_model = Deeplabv3()
img ="Path/Input/0/0001.png")
imResize = img.resize((512,512), Image.ANTIALIAS)
imResize = np.array(imResize)
img2 = cv2.cvtColor(imResize, cv2.COLOR_GRAY2RGB)
w, h, _ = img2.shape
ratio = 512. / np.max([w,h])
resized = cv2.resize(img2,(int(ratio*h),int(ratio*w)))
resized = resized / 127.5 - 1.
pad_x = int(512 - resized.shape[0])
resized2 = np.pad(resized,((0,pad_x),(0,0),(0,0)),mode='constant')
res = deeplab_model.predict(np.expand_dims(resized2,0))
labels = np.argmax(res.squeeze(),-1)

First question: The DeepLabV3+ is a very large model (I assume you are using the Xception backbone?!) and 11 GB of needed GPU capacity is totally normal regarding a bachsize of 32 with 200x300 pixels :) (Training DeeplabV3+, I needed approx. 11 GB using a batchsize of 5 with 500x500 pixels). One note to the second sentence of your question: the needed GPU resources are influenced by many factors (model, optimizer, batchsize, image crop, preprocessing etc) but the actual size of your dataset set shouldn't influence it. So it doesn't matter if your dataset is 300MB or 300GB large.
General Question: You are using a small dataset. Choosing DeeplabV3+ & Xception might not be a good fit, since the model might be too large. This might lead to overfitting. If you haven't obtained satisfying results yet you might try a smaller network. If you want to stick to the DeepLab-framework you could switch the backbone from the Xception network to MobileNetV2 (In the official tensorflow version it is already implemented). Alternatively, you could try using a standalone network like the Inception network with a FCN head...
In each case it would be essential to use a pre-trained encoder with a well-trained feature representation. If you don't find a good initialization of your desired model based on grayscale input images, just use a model pre-trained on RGB images and extend the pre-training with a grayscale dataset (basically you can convert any big rgb dataset to be grayscale) and finetune the weights on the grayscale input before using your data.
I hope this helps! Cheers, Frank

IBM's Large Model Support (LMS) library enables training of large deep neural networks that would normally exhaust GPU memory while training. LMS manages this over-subscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.
Description -
Pytorch -
TensorFlow -


Loaded keras model fails to continue training, dimensions mismatch

I'm using tensorflow with keras to train to a char-RNN using google colabs. I train my model for 10 epochs and save it, using '' as shown in the documentation for saving models. Immediately after, I load it again just to check, I try to call on the loaded model and I get a "Dimensions must be equal" error using the exact same training set. The training data is in a tensorflow dataset organised in batches as shown in the documentation for tf datasets. Here is a minimal working example:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
X = np.random.randint(0,50,(10000))
seq_len = 150
batch_size = 20
dataset =
dataset = dataset.batch(seq_len+1,drop_remainder=True)
dataset = x: (x[:-1],x[1:]))
dataset = dataset.shuffle(20).batch(batch_size,drop_remainder=True)
def make_model(vocabulary_size,embedding_dimension,rnn_units,batch_size,stateful):
model = Sequential()
return model
vocab_size = 51
emb_dim = 20
rnn_units = 10
model = make_model(vocab_size,emb_dim,rnn_units,batch_size,False),epochs=10)'/content/test_model')
model2 = tf.keras.models.load_model('/content/test_model'),epochs=10)
The first training line, "", runs fine but the last line returns the error:
ValueError: Dimensions must be equal, but are 20 and 150 for '{{node
Equal}} = Equal[T=DT_INT64, incompatible_shape_error=true](ArgMax,
ArgMax_1)' with input shapes: [20], [20,150].
I want to be able to resume training later, as my real dataset is much larger. Therefore, saving only the weights is not an ideal option.
Any advice?
If you have saved checkpoints than, from those checkpoints, you can resume with reduced dataset. Your neural network / layers and dimensions should be same.
The problem is the 'accuracy' metric. For some reason, there is some mishandling of dimensions on the predictions when the model is loaded with this metric, as I found in this thread (see last comment). Running model.compile() on the loaded model with the same metric allows training to continue. However, it shouldn't be necessary to compile the model again. Moreover, this means that the optimiser state is lost, as explained in this answer, thus, this is not very useful for resuming training.
On the other hand, using 'sparse_categorical_accuracy' from the start works just fine. I am able to load the model and continue training without having to recompile. In hindsight, this choice is more appropriate given that the outputs of my last layer are logits over the distribution of characters. Thus, this is not a binary but a multiclass classification problem. Nonetheless, I verified that both 'accuracy' and 'sparse_categorical_accuracy' returned the same values in my specific example. Thus, I believe that keras is internally converting accuracy to categorical accuracy, but something goes wrong when doing this on a model that has been just loaded which forces the need to recompile.
I also verified that if the saved model was compiled with 'accuracy', loading the model and recompiling with 'sparse_categorical_accuracy' will allow resuming training. However, as mentioned before, this would discard the state of the optimiser and I suspect that it would be no better than just making a new model and loading only the weights from the saved one.

Tensorflow: Classifying images in batches

I have followed this TensorFlow tutorial to classify images using transfer learning approach. Using almost 16,000 manually classified images (with about 40/60 split of 1/0) added on top of the pre-trained MobileNet V2 model, my model achieved 96% accuracy on the hold out test set. I then saved the resulting model.
Next, I would like to use this trained model to classify new images. To do so, I have adapted one of the portions of the tutorial's code (in the end where it says #Retrieve a batch of images from the test set) in the way described below. The code works, however, it only processes one batch of 32 images and that's it (there are hundreds of images in the source folder). What am I missing here? Please advise.
# Import libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing import image_dataset_from_directory
import matplotlib.pyplot as plt
import numpy as np
import os
# Load saved model
model = tf.keras.models.load_model('/model')
# Re-compile model
base_learning_rate = 0.0001
# Define paths
PATH = 'Data/'
new_dir = os.path.join(PATH, 'New_images') # New_images must contain at least one class (sub-folder)
IMG_SIZE = (640, 640)
new_dataset = image_dataset_from_directory(new_dir, shuffle=True, batch_size=BATCH_SIZE, image_size=IMG_SIZE)
# Retrieve a batch of images from the test set
image_batch, label_batch = new_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()
# Apply a sigmoid since our model returns logits
predictions = tf.nn.sigmoid(predictions)
predictions = tf.where(predictions < 0.5, 0, 1)
print('Predictions:\n', predictions.numpy())
len(new_dataset) # equals 25, i.e., there are 25 batches
Replace this code:
# Retrieve a batch of images from the test set
image_batch, label_batch = new_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()
with this one:
predictions = model.predict(new_dataset,batch_size=BATCH_SIZE).flatten() objects can be directly passed to the method predict(). Reference

Loss stuck from first epoch when using with float16 in 3D (Keras)

I've got a 3D segmentation task that I'd like to solve with a CNN in keras, using 64x64x64 patches. I know that float16 is more than good enough for the kind of data I've got in input. It's also more than good enough for the output, it's a segmentation, it could even be a uint8 (or a bool).
If I work with float32 my network works perfectly (I mean... as much as you can expect before optimizing it). I've tried switching to float16 for speed and... The loss gets stuck from iteration 0, the weights don't get updated etc... What am I doing wrong?
I've tried the following: changing loss function, changing optimizer, changing learning rate by many orders of magnitude.
I've been able to reproduce the issue with a minimal example: a network with 2 layers, no downsampling, that has to learn to predict segmented masks from the masks themselves. Works fine in float32 not in float16. I'm working in Colab with default settings, therefore keras v2.2.5 and tensorflow 1.15.0
EDIT: more updates The issue is most likely one of numerical precision in computing the loss function when using float16 and a 64x64x64 patch. When using 32x32x32 ones it does work.
from keras import backend as K
from keras.engine import Input, Model
from keras.layers import Conv3D, Activation
chosenDataType = 'float16' #swap between float16 and float32 to test
#set up a random, minimal 3D CNN
inputL = Input([64,64,64,1],dtype=chosenDataType)
l1 = Conv3D(32,[3,3,3],padding='same')(inputL)
l2 = Activation('relu') (l1)
l3 = Conv3D(32,[3,3,3],padding='same')(l2)
l4 = Activation('relu') (l3)
l5 = Conv3D(1,[1,1,1],padding='same')(l4)
model = Model(inputs=inputL,output=l5)
#create some fake black images with white spots
import numpy as np
import random
X_train = np.zeros([30,64,64,64,1],dtype=chosenDataType)
for imIdx in range(30):
centPoin = random.randrange(20,44)
#ask the network to fit the images themselves.
Y_train = X_train.copy(),Y_train,batch_size=30,epochs=100)

How to clean images to use with a MNIST trained model?

I am creating a machine learning model for classifying images of numbers. I have trained the model using Tensorflow and Keras using the inbuilt tf.keras.datasets.mnist dataset. The model works quite well with the test images from the mnist dataset itself but I would like to feed it images of my own. The images that I am feeding this model is extracted from a Captcha so they will follow a similar pattern. I have included some examples of the images in this public google drive folder. When I feed these images, I noticed that the model is not very accurate and I have some guesses as to why.
The background of the image creates too much noise in the picture.
The number is not centered.
The image is not striclty in the color format of MNIST training set (Black background white text).
I wanted to ask how can I remove the background and centre it so that the noise in the image is reduced allowing for better classifications.
Here is the model I am using:
import tensorflow as tf
from tensorflow import keras
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
class Stopper(keras.callbacks.Callback):
def on_epoch_end(self, epoch, log={}):
if log.get('acc') >= 0.99:
self.model.stop_training = True
print('\nReached 99% Accuracy. Stopping Training...')
model = keras.Sequential([
keras.layers.Dense(1024, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)])
x_train, x_test = x_train / 255, x_test / 255, y_train, epochs=10, callbacks=[Stopper()])
And here is my method of importing the image into tensorflow:
from PIL import Image
img ="image_file_path").convert('L').resize((28, 28), Image.ANTIALIAS)
img = np.array(img)
I have also included some examples from the MNIST dataset here. I would like a script to convert my images as closely to the MNIST dataset format as possible. Also, since I would have to do this for an indefinite number of images, I would appreciate if you could provide a fully automated method for this conversion. Thank you very much.
You need to train with a dataset similar to the images you're testing. The MNIST data is hand-written numbers, which is not going to be similar to the computer generated fonts for Captcha data.
What you need to do is gain a catalog of Captcha data similar to what you're predicting on (preferably from the same source you will be inputting to the final model). It's a painstaking task to capture the data, and you'll probably need around 300-400 images for each label before you start to get something useful.
A key note: your model will only ever be as good as the training data you supplied to the model. Trying to make a good model with bad training data is an effort in pure frustration
To address some of your thoughts:
[the model is not very accurate because] the background of the image creates too much noise in the picture.
This is true. If the image data has noise and the neural net was not trained using any noise in the images, then it will not recognize a strong pattern when it encounters this type of distortion. One possible way to combat this is to take clean images and progamatically add noise to the image (noise similar to what you see in the real Captcha) before sending it to be trained.
[the model is not very accurate because] The number is not centered.
Also true for the same reasons. If all the training data is centered, the model will be overtuned for this property and make incorrect guesses. Follow a similar pattern to the one above if you don't have the capacity to manually capture and catalog a good sampling of data.
[the model is not very accurate because] The image is not striclty in the color format of MNIST training set (Black background white text).
You can get around this by applying a binary threshold to the data before processing/ normalize the color input before training. Depending on the amount of noise in the captcha you may have better results allowing the number and noise to retain some of it's color information (still put in greyscale and normalize, just don't apply the threshold).
Additionally I'd recommend using a convolution net rather than the linear network as it is better at distinguishing 2D features like edges and corners. i.e. use keras.layers.Conv2D layers before flattening with keras.layers.Flatten
See the great example found here: Trains a simple convnet on the MNIST dataset.
model = tf.keras.models.Sequential(
kernel_size=(3, 3),
tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
num_classes, activation=tf.nn.softmax
I've used this setup for reading fonts in video gameplay footage, and with a test set of 10,000 images I'm achieving 99.98% accuracy, using a random sampling of half the dataset in training, and calculating accuracy using the total set.

Keras VGG16 preprocess_input modes

I'm using the Keras VGG16 model.
I've seen it there is a preprocess_input method to use in conjunction with the VGG16 model. This method appears to call the preprocess_input method in which (depending on the case) calls _preprocess_numpy_input method in
The preprocess_input has a mode argument which expects "caffe", "tf", or "torch". If I'm using the model in Keras with TensorFlow backend, should I absolutely use mode="tf"?
If yes, is this because the VGG16 model loaded by Keras was trained with images which underwent the same preprocessing (i.e. changed input image's range from [0,255] to input range [-1,1])?
Also, should the input images for testing mode also undergo this preprocessing? I'm confident the answer to the last question is yes, but I would like some reassurance.
I would expect Francois Chollet to have done it correctly, but looking at either he is or I am wrong about using mode="tf".
Updated info
#FalconUA directed me to the VGG at Oxford which has a Models section with links for the 16-layer model. The information about the preprocessing_input mode argument tf scaling to -1 to 1 and caffe subtracting some mean values is found by following the link in the Models 16-layer model: information page. In the Description section it says:
"In the paper, the model is denoted as the configuration D trained with scale jittering. The input images should be zero-centered by mean pixel (rather than mean image) subtraction. Namely, the following BGR values should be subtracted: [103.939, 116.779, 123.68]."
The mode here is not about the backend, but rather about on what framework the model was trained on and ported from. In the keras link to VGG16, it is stated that:
These weights are ported from the ones released by VGG at Oxford
So the VGG16 and VGG19 models were trained in Caffe and ported to TensorFlow, hence mode == 'caffe' here (range from 0 to 255 and then extract the mean [103.939, 116.779, 123.68]).
Newer networks, like MobileNet and ShuffleNet were trained on TensorFlow, so mode is 'tf' for them and the inputs are zero-centered in the range from -1 to 1.
In my experience in training VGG16 in Keras, the inputs should be from 0 to 255, subtracting the mean [103.939, 116.779, 123.68]. I've tried transfer learning (freezing the bottom and stack a classifier on top) with inputs centering from -1 to 1, and the results are much worse than 0..255 - [103.939, 116.779, 123.68].
Trying to use VGG16 myself again lately, i had troubles getting descent results by just importing preprocess_input from vgg16 like this:
from keras.applications.vgg16 import VGG16, preprocess_input
Doing so, preprocess_input by default is set to 'caffe' mode but having a closer look at keras vgg16 code, i noticed that weights name
is referring to tensorflow twice. I think that preprocess mode should be 'tf'.
processed_img = preprocess_input(img, mode='tf')