Yolo v3 Loss Function fails when reloaded during load_model - tensorflow

I've been trying to save a Yolo v3 model and then Load it back from from an h5 file.
When saving I use the checkpoint (ModelCheckpoint) to save the model (with the parameter save_weights_only
set to False in order to save the WHOLE model).
However, when I tried to recover the same model by using the keras load_model function, I initially get a yolo_head function not found error.
I then tried to add the function as a parameter to the load function as in:
{"yolo_head":yolo_head}
Now, the issue becomes: "TypeError: list indices must be integers or slices, not list" because somehow, there's an error in the loss function (yolo_loss, line 444) when loaded dynamically.
Apparently, the binary code of the loss function is hard copied into the h5 file.
My question is this:
Is there a better/simpler YOLO loss function that I can use THAT DOES NOT refer to other functions or can be easily reloaded?
Thanks in advance,
EDIT 1: Additional Code Snippets,
Keras Checkpoint Callback definition:
checkpoint = ModelCheckpoint(
os.path.join(log_dir, "checkpoint.h5"),
monitor="val_loss",
save_weights_only=False,
save_best_only=True,
period=1,
)
Checkpoint added to model training:
history = model.fit_generator(
data_generator_wrapper(
lines[:num_train], batch_size, input_shape, anchors, num_classes
),
steps_per_epoch=max(1, num_train // batch_size),
validation_data=data_generator_wrapper(
lines[num_train:], batch_size, input_shape, anchors, num_classes
),
validation_steps=max(1, num_val // batch_size),
epochs=epoch1,
initial_epoch=0,
callbacks=[logging, checkpoint],
)
Trying to load the same file 'checkpoint.h5' after pre-training ended:
weights_path = os.path.join(log_dir, "checkpoint.h5")
model = load_model(weights_path, {"yolo_head":yolo_head, "tf":tf, "box_iou":box_iou,'<lambda>': lambda y_true, y_pred: y_pred})
Here is the error stack trace:
File "2_Training/Train_YOLO.py", line 206, in
model = load_model(weights_path, {"yolo_head":yolo_head, "tf":tf,
"box_iou":box_iou,'': lambda y_true, y_pred: y_pred})
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/engine/saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/engine/saving.py", line 225, in _deserialize_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/engine/saving.py", line 458, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/layers/init.py", line 55, in deserialize
printable_module_name='layer')
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/utils/generic_utils.py", line 145, in deserialize_keras_object
list(custom_objects.items())))
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/engine/network.py", line 1032, in from_config
process_node(layer, node_data)
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/engine/network.py", line 991, in process_node
layer(unpack_singleton(input_tensors), **kwargs)
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/engine/base_layer.py", line 457, in call
output = self.call(inputs, **kwargs)
File "/Users/nkwedi/.pyenv/versions/3.7.5/lib/python3.7/site-packages/keras/layers/core.py", line 687, in call
return self.function(inputs, **arguments)
File "/Users/nkwedi/Documents/MyProjects/Eroscope/EyeDetectionYOLO/2_Training/src/keras_yolo3/yolo3/model.py", line 444, in yolo_loss
anchors[anchor_mask[l]],
TypeError: list indices must be integers or slices, not list

The Solution for me was to use a Cloud Based Training Platform like Google Collab.
Here's a link to a workable Collab Notebook with GPU enabled:
YOLO v3 Google Collab Tutorial

Related

Incompatible shapes while using triplet loss and pre-trained resnet

I am trying to use pre-trained resnet and fine-tune it using triplet loss. The following code I came up with is a combination of tutorials I found on the topic:
import pathlib
import tensorflow as tf
import tensorflow_addons as tfa
with tf.device('/cpu:0'):
INPUT_SHAPE = (32, 32, 3)
BATCH_SIZE = 16
data_dir = pathlib.Path('/home/user/dataset/')
base_model = tf.keras.applications.ResNet50V2(
weights='imagenet',
pooling='avg',
include_top=False,
input_shape=INPUT_SHAPE,
)
# following two lines are added after edit, originally it was model = base_model
head_model = tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))(base_model.output)
model = tf.keras.Model(inputs=base_model.input, outputs=head_model)
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rotation_range=10,
zoom_range=0.1,
)
generator = datagen.flow_from_directory(
data_dir,
target_size=INPUT_SHAPE[:2],
batch_size=BATCH_SIZE,
seed=42,
)
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tfa.losses.TripletSemiHardLoss(),
)
model.fit(
generator,
epochs=5,
)
Unfortunately after running the code I get the following error:
Found 4857 images belonging to 83 classes.
Epoch 1/5
Traceback (most recent call last):
File "ReID/external_process.py", line 35, in <module>
model.fit(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
tmp_logs = train_function(iterator)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1843, in _filtered_call
return self._call_flat(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1923, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 545, in call
outputs = execute.execute(
File "/home/user/videolytics/venv_python/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 1328 values, but the requested shape has 16
[[{{node TripletSemiHardLoss/PartitionedCall/Reshape}}]] [Op:__inference_train_function_13749]
Function call stack:
train_function
2020-10-23 22:07:09.094736: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
The dataset directory has 83 subdirectories, one per class and each of this subdirectories contains images of given class. The dimension 1328 in the error output is the batch size (16) times number of classes (83), and the dimension 16 is the batch size (both dimensions change accordingly if I change the BATCH_SIZE.
To be honest I do not really understand the error, so any solution or even any kind of indight where is the problem is deeply appreciated.
The problem is that the TripletSemiHardLoss expects
labels y_true to be provided as 1-D integer Tensor with shape [batch_size] of multi-class integer labels
but the flow_from_directory by default generate categorical labels; using class_mode="sparse" should fix the problem.

L2-normalization with Keras Backend?

I'd like to normalize the inputs going into my neural network but, as I'm defining my model in this way:
df = pd.read_csv(r'C:\Users\Davide Mori\PycharmProjects\pythonProject\Dataset.csv')
print(df)
target_column = ['W_mag', 'W_phase']
predictors = list(set(list(df.columns)) - set(target_column))
X = df[predictors].values
Y = df[target_column].values
def get_model(n_inputs, n_outputs):
model = Sequential()
model.add(Dense(1000,input_dim= n_inputs, activation='relu'))
#model.add(Lambda(lambda x: K.l2_normalize(x, axis=1)))
model.add(Dense(1000, activation='linear', activity_regularizer=regularizers.l1(0.0001)))
model.add(Activation('relu'))
model.add(Dense(n_outputs, activation='linear'))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["mean_squared_error"])
model.summary()
return model
n_inputs, n_outputs = X.shape[1], Y.shape[1]
model = get_model(n_inputs, n_outputs)
# fit the model on all data
model.fit(X, Y, epochs=100, batch_size=1)
how do I apply the lambda layer to my inputs? Isn't wrong the commented line position? Because If I put the lambda layer there I'm normalizing what is already be "transformed" by the first hidden layer,right? How can I solve this problem?
This is the error I have when putting the lambda layer before everything else :
2020-10-12 15:08:46.036872: I
tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: AVX AVX2
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm
2020.2.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197,
in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the
script
File "C:\Program Files\JetBrains\PyCharm
2020.2.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line
18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Davide Mori/PycharmProjects/pythonProject/prova_rete_sfs.py",
line 60, in <module>
model = get_model(n_inputs, n_outputs)
File "C:/Users/Davide Mori/PycharmProjects/pythonProject/prova_rete_sfs.py",
line 52, in get_model
model.summary()
File "C:\Users\Davide Mori\Anaconda3\envs\pythonProject\lib\site-
packages\tensorflow_core\python\keras\engine\network.py", line 1302, in
summary
raise ValueError('This model has not yet been built. '
ValueError: This model has not yet been built. Build the model first by
calling `build()` or calling `fit()` with some data, or specify an
`input_shape` argument in the first layer(s) for automatic build.

How to use tensorflow sequence_numeric_column with an RNNClassifier?

I was looking throw the tensorflow contrib API and I wanted to use the RNNClassifier available with Tensorflow 1.13. Contrary to non sequence estimators, this one needs sequence feature columns only. However I was not able to make it work on a toy dataset. I keep getting an error while using sequence_numeric_column.
Here is the structure of my toy dataset:
idSeq,kind,label,size
0,0,dwarf,117.6
0,0,dwarf,134.4
0,0,dwarf,119.0
0,1,human,168.0
0,1,human,145.25
0,2,elve,153.9
0,2,elve,218.49999999999997
0,2,elve,210.9
1,0,dwarf,166.6
1,0,dwarf,168.0
1,0,dwarf,131.6
1,1,human,150.5
1,1,human,208.25
1,1,human,210.0
1,2,elve,199.5
1,2,elve,161.5
1,2,elve,197.6
where idSeq allow us to see which rows belong to which sequence.
I want to predict the "kind" column thanks to the "size" column.
Below there is my code about make my RNN training on my dataset.
import numpy as np
import pandas as pd
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.INFO)
dataframe = pd.read_csv("data_rnn.csv")
dataframe_test = pd.read_csv("data_rnn_test.csv")
train_x = dataframe
train_y = dataframe.loc[:,(["kind"])]
size_feature_col = tf.contrib.feature_column.sequence_numeric_column('size ')
estimator = tf.contrib.estimator.RNNClassifier(
sequence_feature_columns=[size_feature_col ],
num_units=[32, 16],
cell_type='lstm',
model_dir=None,
n_classes=3,
optimizer='Adagrad'
)
def make_dataset(
batch_size,
x,
y=None,
shuffle=False,
shuffle_buffer_size=1000,
shuffle_seed=1):
"""
An input function for training, evaluation or prediction.
Parameters
----------------------
batch_size: integer
the size of the batch to use for the training of the neural network
x: pandas dataframe
dataframe that contains the features of the samples to study
y: pandas dataframe or array (Default: None)
dataframe or array that contains the values to predict of the samples
to study. If none, we want a dataset for evaluation or prediction.
shuffle: boolean (Default: False)
if True, we shuffle the elements of the dataset
shuffle_buffer_size: integer (Default: 1000)
if we shuffle the elements of the dataset, it is the size of the buffer
used for it.
shuffle_seed : integer
the random seed for the shuffling
Returns
---------------------
dataset.make_one_shot_iterator().get_next(): Tensor
a nested structure of tf.Tensors containing the next element of the
dataset to study
"""
def input_fn():
if y is not None:
dataset = tf.data.Dataset.from_tensor_slices((dict(x), y))
else:
dataset = tf.data.Dataset.from_tensor_slices(dict(x))
if shuffle:
dataset = dataset.shuffle(
buffer_size=shuffle_buffer_size,
seed=shuffle_seed).batch(batch_size).repeat()
else:
dataset = dataset.batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
return input_fn
batch_size = 50
random_seed = 1
input_fn_train = make_dataset(
batch_size=batch_size,
x=train_x,
y=train_y,
shuffle=True,
shuffle_buffer_size=len(train_x),
shuffle_seed=random_seed)
estimator.train(input_fn=input_fn_train, steps=5000)
But I only got the following error :
INFO:tensorflow:Calling model_fn.
Traceback (most recent call last):
File "main.py", line 125, in <module>
estimator.train(input_fn=input_fn_train, steps=5000)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/contrib/estimator/python/estimator/rnn.py", line 512, in _model_fn
config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/contrib/estimator/python/estimator/rnn.py", line 332, in _rnn_model_fn
logits, sequence_length_mask = logit_fn(features=features, mode=mode)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/contrib/estimator/python/estimator/rnn.py", line 226, in rnn_logit_fn
features=features, feature_columns=sequence_feature_columns)
File "/root/.local/lib/python3.5/site-packages/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py", line 120, in sequence_input_layer
trainable=trainable)
File "/root/.local/lib/python3.5/site-packages/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py", line 496, in _get_sequence_dense_tensor
sp_tensor, default_value=self.default_value)
File "/root/.local/lib/python3.5/site-packages/tensorflow/python/ops/sparse_ops.py", line 1432, in sparse_tensor_to_dense
sp_input = _convert_to_sparse_tensor(sp_input)
File "/root/.local/lib/python3.5/site-packages/tensorflow/python/ops/sparse_ops.py", line 68, in _convert_to_sparse_tensor
raise TypeError("Input must be a SparseTensor.")
TypeError: Input must be a SparseTensor.
So I don't understand what I've done wrong because on the documentation, it is written that we have to give a sequence column to the RNNEstimator. They do not say anything about giving sparse tensor.
Thanks in advance for your help and advices.

Resource exhausted: OOM when allocating tensor with shape[845246,300]

I am working with a sequence to sequence language model, and after changing the code to pass custom word embedding weights to the Embeddings layer, I am receiving a OOM error when I try to train on the gpu.
Here is the relevant code:
def create_model(word_map, X_train, Y_train, vocab_size, max_length):
# define model
model = Sequential()
# get custom embedding weights as matrix
embedding_matrix = get_weights_matrix_from_word_map(word_map)
model.add(Embedding(len(word_map)+1, 300, weights=[embedding_matrix], input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, epochs=100, verbose=2)
return model
And here is the full error log from the server:
File "/home2/slp24/thesis/UpdatedLanguageModel_7_31.py", line 335, in create_model_2
model.fit(X_train, Y_train, batch_size=32, epochs=1, verbose=2) ## prev X, y
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/models.py", line 963, in fit
validation_steps=validation_steps)
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/engine/training.py", line 1682, in fit
self._make_train_function()
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/engine/training.py", line 990, in _make_train_function
loss=self.total_loss)
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/opt/python-3.4.1/lib/python3.4/site-packages/keras/optimizers.py", line 466, in get_updates
m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 898, in binary_op_wrapper
y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 932, in convert_to_tensor
as_ref=False)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1022, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 100, in _IndexedSlicesToTensor
value.values, value.indices, value.dense_shape[0], name=name)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5186, in unsorted_segment_sum
num_segments=num_segments, name=name)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/opt/python-3.4.1/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[845246,300] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: training/Adam/mul_2/y = UnsortedSegmentSum[T=DT_FLOAT, Tindices=DT_INT32, Tnumsegments=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/embedding_1/Gather_grad/Reshape, training/Adam/gradients/embedding_1/Gather_grad/Reshape_1/_101, training/Adam/mul_2/strided_slice)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Edit:
So far I have tried
Adding batching, started with batch_size=32
I am currently working to decrease the number of output classes from 845,286. I think something went wrong when I calculated the custom embedding matrix, specifically when I was "connecting" the vocabulary token index's assigned during preprocessing and the y_categorical values assigned by Keras that the model uses...
Any help or guidance is greatly appreciated! I have searched many similar issued but have not been able to apply those fixes to my code thus far. Thank you
You're exceeding the memory size of your GPU.
You can:
Train/Predict with smaller batches
Or, if even a batch_size=1 is too much, you need a model with less parameters.
Hint, the length in that tensor (845246) is really really big. Is that the correct length?
I had the same problem with Google Colab GPU
The batch size was 64 and this error has appeared and after I reduced the batch size to 32 it worked properly

Memory error when initializing Xception using Keras

I am having difficulty implementing the pre-trained Xception model for binary classification over new set of classes. The model is successfully returned from the following function:
#adapted from:
#https://github.com/fchollet/keras/issues/4465
from keras.applications.xception import Xception
from keras.layers import Input, Flatten, Dense
from keras.models import Model
def get_xception(in_shape,trn_conv):
#Get back the convolutional part of Xception trained on ImageNet
model = Xception(weights='imagenet', include_top=False)
#Here the input images have been resized to 299x299x3, so this is the
#same as Xception's native input
input = Input(in_shape,name = 'image_input')
#Use the generated model
output = model(input)
#Only train the top fully connected layers (keep pre-trained feature extractors)
for layer in model.layers:
layer.trainable = False
#Add the fully-connected layers
x = Flatten(name='flatten')(output)
x = Dense(2048, activation='relu', name='fc1')(x)
x = Dense(2048, activation='relu', name='fc2')(x)
x = Dense(2, activation='softmax', name='predictions')(x)
#Create your own model
my_model = Model(input=input, output=x)
my_model.compile(loss='binary_crossentropy', optimizer='SGD')
return my_model
This returns fine, however when I run this code:
model=get_xception(shp,trn_feat)
in_data=HDF5Matrix(str_trn,'/inputs')
labels=HDF5Matrix(str_trn,'/labels')
model.fit(in_data,labels,shuffle="batch")
I get the following error:
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/engine/training.py", line 1576, in fit
self._make_train_function()
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/engine/training.py", line 960, in _make_train_function
loss=self.total_loss)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/keras/optimizers.py", line 169, in get_updates
v = self.momentum * m - lr * g # velocity
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 705, in _run_op
return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 865, in binary_op_wrapper
return func(x, y, name=name)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1088, in _mul_dispatch
return gen_math_ops._mul(x, y, name=name)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1449, in _mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/tsmith/.virtualenvs/keras/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[204800,2048]
[[Node: training/SGD/mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](SGD/momentum/read, training/SGD/Variable/read)]]
I have been tracing the function calls for hours now and still can't figure out what is happening. The system should be far above and beyond the requirements. System specs:
Ubuntu Version: 14.04.5 LTS
Tensorflow Version: 1.3.0
Keras Version: 2.0.7
28x dual core Inten Xeon processor (1.2 GHz)
4x NVidia GeForce 1080 (8Gb memory each)
Any clues as to what is going wrong here?
Per Yu-Yang, the simplest solution was to reduce the batch size, everything ran fine after that!