How to reset stateful optimizer without `model.compile`? - tensorflow

I have tried resetting the state of the optimizer by re-compiling the model. Then I found that it could lead to a memory leak.
Reproducible code:
import tensorflow as tf
import numpy as np
import objgraph
m = tf.keras.models.Sequential([tf.keras.layers.InputLayer(input_shape=(20,)), tf.keras.layers.Dense(10),tf.keras.layers.Dense(10),tf.keras.layers.Dense(10),tf.keras.layers.Dense(10)])
objgraph.show_growth()
for i in range(100):
tf.keras.backend.clear_session()
m.compile('adam', loss='mse')
data = np.arange(32*20).reshape(32, 20)
labels = np.zeros(32)
results = m.fit(data, labels, epochs=10)
objgraph.show_growth()
I want to reset the state of the optimizer in each fit(). But during each fit() process, the optimizer is wanted to hold its state. How can I achieve that?

Related

LSTM training error is very high and relatively unchanging

As a learning exercise, I'm trying to use an LSTM model with the Keras framework to predict the stock market based on multiple data points. The size of my input array is roughly [5000, 100]. Based on other questions on this site and articles online, the approach seems fairly standard: put the data in a numpy array, scale it, reshape it to 3 dimensions for the LSTM, split it into train and test sections, and feed it through the model. Running only the training portion of the model, I am consistently getting loss scores around 400,000,000. This is not changed by altering the batch size, the number of epochs, the number of layers, replacing the normalization with dropout layers, changing the sizes of each layer, or using different optimizers and loss functions. Any idea why the loss is so high and what I can do to fix that? Attached is the code. All advice is greatly appreciated.
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, losses, optimizers, Model, preprocessing
from keras.utils import plot_model
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler(feature_range=(0, 1))
features_df = pd.read_csv("dataset.csv")
features_np = np.array(features_df)
features_np.astype(np.float64)
scaler.fit_transform(features_np)
num_features=features_np.shape[1]
features = np.reshape(features_np, (features_np.shape[0], 1, features_np.shape[1]))
labels_np = np.array(pd.read_csv("output.csv"))
scaler.fit_transform(labels_np)
test_in = features_np[int(features_np.shape[0] * 0.75):]
test_in = np.reshape(test_in, (test_in.shape[0], 1, test_in.shape[1]))
test_out = labels_np[int(labels_np.shape[0] * 0.75):]
test_out = np.reshape(test_out, (test_out.shape[0], 1, test_out.shape[1]))
inputs = layers.Input(shape=(1, features.shape[2]))
x = layers.LSTM(5000, return_sequences=True)(inputs)
lstm1 = layers.LSTM(1000, return_sequences=True)(x)
norm1 = layers.BatchNormalization()(lstm1)
lstm2 = layers.LSTM(1000, return_sequences=True)(norm1)
lstm3 = layers.LSTM(1000, return_sequences=True)(lstm2)
norm2 = layers.BatchNormalization()(lstm3)
lstm4 = layers.LSTM(1000, return_sequences=True)(norm2)
lstm5 = layers.LSTM(1000)(lstm4)
dense1 = layers.Dense(1000, activation='relu')(lstm5)
dense2 = layers.Dense(1000, activation='sigmoid')(dense1)
outputs = layers.Dense(2)(dense2)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(features, labels_np, epochs=1, batch_size=4)
evaluate = model.evaluate(test_in, test_out, verbose=2)
While I have not solved the error, implementing the Sequential() model and using only two LSTM layers and a Dense layer changed the error: the training error is now very low while testing remains high. This now appears to be a (relatively) simple problem of overfitting rather than the more confusing error of high training loss. Hopefully, this helps anyone having a similar problem.
There are two things i notice and dont understand why you use them. First one is , dense2 layer with sigmoid activation. I dont think sigmoid activation is benefical to when we are trying to solve a regression problem. Can you change that to relu and see what happens. Second one is you have two dense layers. You did not specify that but i think you are predicting two values with same inputs. If you are trying to predict just one value, you should you should change that to
outputs = layers.Dense(1)(dense2)

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

CUDA|RAM runs out of memory using Keras model.fit_generator

I have a video data input of the shape (300,226,226,3) with channel last configuration & my output is (300,1) stored as numpy array formats. As I don't want to load all the data at once as it is around 120GB. My code is pretty simple:
import os
import sys
from random import shuffle
import numpy as np
import tensorflow as tf
from keras.layers import (BatchNormalization, Dense, Flatten, Input,
MaxPooling3D, TimeDistributed)
from keras.layers.convolutional import Conv3D
from keras.layers.convolutional_recurrent import ConvLSTM2D
from keras.layers.normalization import BatchNormalization
from keras.models import Model, Sequential
from keras.utils import plot_model
from model import My_ConvLSTM_Model
import numpy as np
from random import shuffle
import pandas as pd
import os
def generate_arrays(available_ids):
datar = pd.read_csv("C:/Users/muzaf/Documents/GitHub/Data_mining/data.csv")
while True:
for i in available_ids:
name_ext = str(datar.iat[i, 0])
name = os.path.basename((os.path.splitext(name_ext))[0])
scene = np.load('D:/Webcam/Input/{}.npy'.format(name))
category = np.load('output/{}.npy'.format(name))
yield (np.array([scene]), category[0])
available_ids = [i for i in range(1, 20)]
shuffle(available_ids)
final_train_id = int(len(available_ids)*0.8)
train_ids = available_ids[:final_train_id]
val_ids = available_ids[final_train_id:]
frames = 300
pixels_x = 226
pixels_y = 226
channels = 3
seq = Sequential()
seq.add(ConvLSTM2D(filters=20, kernel_size=(3, 3),
input_shape=(None, pixels_x, pixels_y, channels),
padding='same', data_format='channels_last', return_sequences=True))
seq.add(BatchNormalization())
seq.add(MaxPooling3D(pool_size=(2, 2, 1), strides=None,
padding='valid', data_format='channels_last'))
seq.add(TimeDistributed(Flatten()))
seq.add(TimeDistributed(Dense(32,)))
seq.add(TimeDistributed(Dense(1, activation='relu')))
seq.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
print (seq.summary())
history = seq.fit_generator(
generate_arrays(train_ids), steps_per_epoch=len(train_ids),
validation_data=generate_arrays(val_ids),
validation_steps=len(val_ids),
epochs=100, verbose=1, shuffle=False, initial_epoch=0)
As soon as I run this, my GPU(GTX 1060:6GB) memory gets full & so is my RAM. AM I doing something wrong here?
try to reduce steps_per_epoch & validation_steps. Putting all the data in ones will explode your memory. It's like defining batch size.
steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of samples of your dataset divided by the batch size. Optional for Sequence: if unspecified, will use the len(generator) as a number of steps.
validation_steps: Only relevant if validation_data is a generator. Total number of steps (batches of samples) to yield from generator before stopping. Optional for Sequence: if unspecified, will use the len(validation_data) as a number of steps.
Batch size in Deep learning refers to the number of training examples utilized in one iteration.
You have given such a large batch size of 300. Just reduce the batch to a smaller number of 8,16,32 which is usually a practice in a deep learning experiment. Giving a large batch often leads to GPU out of memory because that much memory won't be available for processing a large batch of images.
One more reason that can lead to out of memory situations can be because of the presence of other processes running in the background. Just do nvidia-smi and see whether there are any processes running in the GPU. If yes then check the memory availability.
Hope this will help you.
First, make sure no more browser applications is running such as Chrome, firefox, so on. Later, turn on the GPU-monitor tool to see its memory utilization for tuning the batch-size parameters. If still not work, try to reduce the size of the training data.

Can't import frozen graph with BatchNorm layer

I have trained a Keras model based on this repo.
After the training I save the model as checkpoint files like this:
sess=tf.keras.backend.get_session()
saver = tf.train.Saver()
saver.save(sess, current_run_path + '/checkpoint_files/model_{}.ckpt'.format(date))
Then I restore the graph from the checkpoint files and freeze it using the standard tf freeze_graph script. When I want to restore the frozen graph I get the following error:
Input 0 of node Conv_BN_1/cond/ReadVariableOp/Switch was passed float from Conv_BN_1/gamma:0 incompatible with expected resource
How can I fix this issue?
Edit: My problem is related to this question. Unfortunately, I can't use the workaround.
Edit 2:
I have opened an issue on github and created a gist to reproduce the error.
https://github.com/keras-team/keras/issues/11032
Just resolved the same issue. I connected this few answers: 1, 2, 3 and realized that issue originated from batchnorm layer working state: training or learning. So, in order to resolve that issue you just need to place one line before loading your model:
keras.backend.set_learning_phase(0)
Complete example, to export model
import tensorflow as tf
from tensorflow.python.framework import graph_io
from tensorflow.keras.applications.inception_v3 import InceptionV3
def freeze_graph(graph, session, output):
with graph.as_default():
graphdef_inf = tf.graph_util.remove_training_nodes(graph.as_graph_def())
graphdef_frozen = tf.graph_util.convert_variables_to_constants(session, graphdef_inf, output)
graph_io.write_graph(graphdef_frozen, ".", "frozen_model.pb", as_text=False)
tf.keras.backend.set_learning_phase(0) # this line most important
base_model = InceptionV3()
session = tf.keras.backend.get_session()
INPUT_NODE = base_model.inputs[0].op.name
OUTPUT_NODE = base_model.outputs[0].op.name
freeze_graph(session.graph, session, [out.op.name for out in base_model.outputs])
to load *.pb model:
from PIL import Image
import numpy as np
import tensorflow as tf
# https://i.imgur.com/tvOB18o.jpg
im = Image.open("/home/chichivica/Pictures/eagle.jpg").resize((299, 299), Image.BICUBIC)
im = np.array(im) / 255.0
im = im[None, ...]
graph_def = tf.GraphDef()
with tf.gfile.GFile("frozen_model.pb", "rb") as f:
graph_def.ParseFromString(f.read())
graph = tf.Graph()
with graph.as_default():
net_inp, net_out = tf.import_graph_def(
graph_def, return_elements=["input_1", "predictions/Softmax"]
)
with tf.Session(graph=graph) as sess:
out = sess.run(net_out.outputs[0], feed_dict={net_inp.outputs[0]: im})
print(np.argmax(out))
This is bug with Tensorflow 1.1x and as another answer stated, it is because of the internal batch norm learning vs inference state. In TF 1.14.0 you actually get a cryptic error when trying to freeze a batch norm layer.
Using set_learning_phase(0) will put the batch norm layer (and probably others like dropout) into inference mode and thus the batch norm layer will not work during training, leading to reduced accuracy.
My solution is this:
Create the model using a function (do not use K.set_learning_phase(0)):
def create_model():
inputs = Input(...)
...
return model
model = create_model()
Train model
Save weights:
model.save_weights("weights.h5")
Clear session (important so layer names are the same) and set learning phase to 0:
K.clear_session()
K.set_learning_phase(0)
Recreate model and load weights:
model = create_model()
model.load_weights("weights.h5")
Freeze as before
Thanks for pointing the main issue! I found that keras.backend.set_learning_phase(0) to be not working sometimes, at least in my case.
Another approach might be: for l in keras_model.layers: l.trainable = False

How to read the top of a Queue multiple times before dequeueing in tensorflow

In the following example, every time I run sess.run([image, label]), a different sample from the queue is returned, thus a different np_image is returned.
Is there a way that I can let the slim.queues.QueueRunners know that I want to use (run) the same sample multiples before a dequeue operation takes place?
The reason I ask is that I have a large op that doesn't fit in my VRAM. I have to break the large op into several small ops and feed a different feed_dict every time a small ops is runned. However, when I run the small op, image changes which break the code. Putting all the small ops in a list and run the list at the same time doesn't work for me because the VRAM size is the limitation.
Thanks!
import tensorflow as tf
import numpy as np
slim = tf.contrib.slim
from datasets import dataset_utils
from tensorflow.python.ops import control_flow_ops
from datasets import dataset_factory
from deployment import model_deploy
from nets import nets_factory
from preprocessing import preprocessing_factory
with tf.Graph().as_default():
dataset = dataset_factory.get_dataset('cifar10', 'train','/home/user/dataset/cifar10')
provider = slim.dataset_data_provider.DatasetDataProvider(
dataset,
num_readers=1,
common_queue_capacity=256,
common_queue_min=128)
[image, label] = provider.get(['image', 'label'])
image_preprocessing_fn = preprocessing_factory.get_preprocessing(
'cifarnet',
is_training=True)
images, labels = tf.train.batch([image, label],
batch_size=32,
num_threads=1,
capacity=64)
with tf.Session() as sess:
with slim.queues.QueueRunners(sess):
for i in range(3):
#in every iteration, the tensor 'image' will be different
#the np_image value will be different as well
np_image, np_label = sess.run([image, label])
Peek operation for queues is currently not supported, for discussion see
https://github.com/tensorflow/tensorflow/issues/7880
A work-around is to restructure your code to take values from tf.Variable objects rather than from tf.dequeue. IE something like this
x = tf.Variable(queue.dequeue())
y = x+2
sess.run(x.initializer)
sess.run(y)
sess.run(y) # same value
sess.run(x.initializer)
sess.run(y) # new value